Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a configurable SVID rotation threshold #4599

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

hiyosi
Copy link
Contributor

@hiyosi hiyosi commented Oct 23, 2023

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

svid rotation in agent

Description of change

introduce a availability_target configurable to control SVID rotation periold.

Which issue this PR fixes

fixes #4115

@hiyosi hiyosi force-pushed the configurable-svid-rotation branch from aabc08f to e6acdea Compare October 23, 2023 01:40
"github.com/spiffe/spire/pkg/agent/client"
)

const (
gracePeriodThreshold = 12 * time.Hour
Copy link
Contributor Author

@hiyosi hiyosi Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tune this value along with minimum number of the availability_target.

@@ -55,6 +55,8 @@ const (

bundleFormatPEM = "pem"
bundleFormatSPIFFE = "spiffe"

minimumAvailabilityTarget = 24 * time.Hour
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tune this value.

@hiyosi hiyosi force-pushed the configurable-svid-rotation branch from e6acdea to f487182 Compare October 23, 2023 01:55
@hiyosi hiyosi marked this pull request as ready for review October 23, 2023 04:29
@hiyosi hiyosi marked this pull request as draft November 7, 2023 01:17

return ttl <= jitteredHalfLife
return ttl <= availabilityTarget
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think, should we introduce the jitter for availability target ?
IMO:

  • ±10% jitter is too large and may shift the availability target.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this may have a jitter, this can even simplify your code here too, because you can get halftime or your availabilityTarget as initial step for the shitter.
if you are worried about 10%, you can reduce that amount when availabilityTarget is set,
what value do you think is good enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we should preserve the jitter (with a reduced percentage in this case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since availability_target is more than 24 hours, what about the using absolute maximum jitter instead of percentage? For example, 0 ~ +10min.

https://github.com/spiffe/spire/pull/4599/files#diff-cfe3084f16541101db16f11e483a339dcfecd43d30e7389b7e9ddb80537e8cccR114

@hiyosi hiyosi marked this pull request as ready for review November 7, 2023 02:12
Copy link
Member

@evan2645 evan2645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple high level comments on docs just to make sure the functionality is clearly communicated ... let me know what you think on those. @amartinezfayo is going to help out reviewing some of the renewal logic changes, so a couple more comments to come. Thanks so much for the contribution @hiyosi 🙏

@@ -68,6 +68,7 @@ This may be useful for templating configuration files, for example across differ
| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem |
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | |
| `workload_x509_svid_key_type` | The workload X509 SVID key type &lt;rsa-2048&vert;ec-p256&gt; | ec-p256 |
| `availability_target` | The amount of time to guarantee the SVID availability. Must be grater than 24h. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `availability_target` | The amount of time to guarantee the SVID availability. Must be grater than 24h. | |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | |

@@ -105,6 +106,13 @@ These are the available profiles that can be set in the `profiling_freq` configu
- `trace`
- `cpu`

### Availability Target

If the `availability_target` is set, the agent will rotate when remaining lifetime of the SVID reaches the `availability_target`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `availability_target` is set, the agent will rotate when remaining lifetime of the SVID reaches the `availability_target`.
If the `availability_target` is set, the agent will rotate an SVID when its remaining lifetime reaches the `availability_target`.

@hiyosi hiyosi force-pushed the configurable-svid-rotation branch 2 times, most recently from 486084d to f948911 Compare November 16, 2023 04:42
@@ -144,7 +144,8 @@ func (r *rotator) rotateSVIDIfNeeded(ctx context.Context) (err error) {
return fmt.Errorf("unexpected value type: %T", r.state.Value())
}

if rotationutil.ShouldRotateX509(r.clk.Now(), state.SVID[0]) {
log := r.c.Log.WithField(telemetry.SPIFFEID, state.SVID[0].URIs[0].String())
if rotationutil.ShouldRotateX509(r.clk.Now(), state.SVID[0], r.c.AvailabilityTarget, log) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should rotate will be affected by log, and that log can have different fields, for example here you added SPIFFEID but in updateCache this will contains entryID and spiffeID,
maybe we can unify? just to be consistent in what we are logging?
and can we quaranty that state.SVID[0].URIs[0].String() will return the same string we have in updateCache?
in case of x509 you can get the spiffeID from certificate itself and add the fields when required in the function itselft

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored some of them and would be happy to review them again.


cachedSVID, ok := m.cache.GetJWTSVID(spiffeID, audience)
if ok && !rotationutil.JWTSVIDExpiresSoon(cachedSVID, now) {
if ok && !rotationutil.JWTSVIDExpiresSoon(cachedSVID, now, m.c.AvailabilityTarget, log) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is the same for all uses of JWTSVIDExpiresSoon or rotationutil.ShouldRotateX509,
since it depends of some metadata from the caller (introduced in logs)
can you test that those functions are getting the values we expect?
maybe it is better to just provide an spiffeID as parameter and add the field in the function itself?

@@ -40,6 +40,8 @@ type RotatorConfig struct {

// Clk is the clock that the rotator will use to create a ticker
Clk clock.Clock

AvailabilityTarget time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment?

// ShouldRotateX509 determines if a given SVID should be rotated, based
// on presented current time, and the certificate's expiration.
func ShouldRotateX509(now time.Time, cert *x509.Certificate) bool {
return shouldRotate(now, cert.NotBefore, cert.NotAfter)
func ShouldRotateX509(now time.Time, cert *x509.Certificate, availabilityTarget time.Duration, log logrus.FieldLogger) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this approach, looks now functions requires to have an availabilityTarged, if it is jwt or x.509, and in both cases we requires to have a logger too.
May we create an struct? and create at the start of the agent? and set the availabilityTarget there? (and maybe add the logger too)
So places where we are using thi public function
somehting like:

type RotationUtil interface {
   X509(now time.Time, cert *x509.Certificate) bool{}
   JWTSVIDExpiresSoon(svid *client.JWTSVID, now time.Time) bool {}
}

type rotationUtil struct {
     availabilityTarget  time.Duration
     log logrus.FieldLogger
}

func (r *rotationUtil) X509(now time.Time, cert *x509.Certificate) bool{}
func (r *rotationUtil) JWTSVIDExpiresSoon(svid *client.JWTSVID, now time.Time) bool {}

but maybe it is an overkill, what do you think about this? (And I dont like interface name :D)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored some of them and would be happy to review them again.

Comment on lines 61 to 63
if availabilityTarget > 0 {
log.Warn("SVID lifetime isn't long enough to guarantee the availability_target, fell back to the default rotation strategy")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me like this log must not be here, since this will happen only when logic from shouldFallbackDefaultRotationStrategy return true because the fracePreriod,
maybe you can move this log there? but that will result in spamming messages...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the log should be done in the shouldFallbackDefaultRotationStrategy function. The current implementation already logs this every time that the SVID is being rotated. I'm not sure if I would suggest to log it only once. That could cause to remain unnoticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is OK if the warning log for each entry to be output at every intervals?

but that will result in spamming messages...
Right, I thought to suppress it.

e.g.

WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0005] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0009] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0010] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0014] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0019] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0024] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0028] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0029] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0033] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0034] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0038] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0039] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0043] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0044] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0048] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/spire/agent/join_token/1e6579a0-82c8-4117-bf4c-cc6fcda4780d" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0049] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  spiffe_id="spiffe://example.org/workload" subsystem_name=manager
...

Copy link
Collaborator

@MarcosDY MarcosDY Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont like that... so you can maybe keep the log where it is, or refactor to simplify it, I think that code can be affected for some of the another comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this approch?
When creating/renewing SVIDs, to check if the availability_target is enough or not.

INFO[0000] SVID is not found. Starting node attestation  subsystem_name=attestor trust_domain_id="spiffe://example.org"
INFO[0000] Node attestation was successful               rettestable=false spiffe_id="spiffe://example.org/spire/agent/join_token/f42c9691-21f9-4c9c-a1d0-388e3b60be4b" subsystem_name=attestor trust_domain_id="spiffe://example.org"
DEBU[0000] Entry created                                 entry=50c1d620-74b2-4bf3-887e-b6585a08b84f selectors_added=1 spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0000] Entry created                                 entry=f886c54d-775f-4d21-9013-082ef8bdef6d selectors_added=1 spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0000] Entry created                                 entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 selectors_added=1 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0000] Renewing stale entries                        cache_type=workload count=3 limit=500 subsystem_name=manager
INFO[0000] Creating X509-SVID                            entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
INFO[0000] Creating X509-SVID                            entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
INFO[0000] Creating X509-SVID                            entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0000] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
DEBU[0000] SVID updated                                  entry=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0000] SVID updated                                  entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0000] SVID updated                                  entry=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0000] Bundle added                                  subsystem_name=svid_store_cache trust_domain_id=example.org
INFO[0000] Starting Workload and SDS APIs                address=/tmp/spire-agent/public/api.sock network=unix subsystem_name=endpoints
DEBU[0000] Initializing health checkers                  subsystem_name=health
DEBU[0812] Updating expiring SVIDs in cache              cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0812] Renewing stale entries                        cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0812] Renewing X509-SVID                            entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/workload" subsystem_name=manager
WARN[0812] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=manager
DEBU[0812] SVID updated                                  entry=50c1d620-74b2-4bf3-887e-b6585a08b84f spiffe_id="spiffe://example.org/workload" subsystem_name=cache_manager
DEBU[0827] Updating expiring SVIDs in cache              cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0827] Renewing stale entries                        cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0827] Renewing X509-SVID                            entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/token" subsystem_name=manager
WARN[0827] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=manager
DEBU[0827] SVID updated                                  entry=f886c54d-775f-4d21-9013-082ef8bdef6d spiffe_id="spiffe://example.org/token" subsystem_name=cache_manager
DEBU[0832] Updating expiring SVIDs in cache              cache_type=workload expiring_svids=1 subsystem_name=manager
DEBU[0832] Renewing stale entries                        cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0832] Renewing X509-SVID                            entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 expires_at="2023-11-20T01:13:15Z" spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
WARN[0832] SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy  entry_id=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=manager
DEBU[0832] SVID updated                                  entry=71d139d2-69a3-4ac7-a9b9-8d482a9fcb31 spiffe_id="spiffe://example.org/test-workload" subsystem_name=cache_manager
DEBU[0850] Rotating agent SVID                           subsystem_name=manager
INFO[0850] Successfully rotated agent SVID               spiffe_id="spiffe://example.org/spire/agent/join_token/f42c9691-21f9-4c9c-a1d0-388e3b60be4b" subsystem_name=manager


return ttl <= jitteredHalfLife
return ttl <= availabilityTarget
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this may have a jitter, this can even simplify your code here too, because you can get halftime or your availabilityTarget as initial step for the shitter.
if you are worried about 10%, you can reduce that amount when availabilityTarget is set,
what value do you think is good enough?

temp, err := tc.makeTemplate()
require.NoError(t, err)

cert, _, err := util.SelfSign(temp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: signing a x.509 certificate looks like an overkill.. we only need cert.NotBefore & cert.NotAfter...
maybe you can create a x509.Certificate struct with those values set?

"github.com/spiffe/spire/pkg/agent/client"
)

const (
gracePeriodThreshold = 12 * time.Hour
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: may we have the same gracePeriodThreshold for x509 and jwt? it looks to me like jwt ones must live less time,
and I'm not sure if we even want to have this for those SVIDs... and i case we have it 12hs looks like a lot of time for a minimun
what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the consideration about JWT-SVIDs is a valid point.
The default JWT-SVID TTL is 5m, and users in general will want a much shorter JWT-SVID TTL than the X509-SVID TTL.
@hiyosi What's the JWT-SVID TTL in your use case? Or was it based mainly on requirements for X509-SVID?
Knowing more about how JWT-SVIDs are issued and consumed in your use case would help to have a better idea about what's the best approach for this in regards to JWT-SVIDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I proposed a feature that would only affect X509.
Then, based on the discussion in the proposal, I implemented a configurable that affects both X509 and JWT.
But it may not be the final conclusion.

#4115 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiyosi We discussed this again and we feel that it would be kind of weird if this configurable affects JWT-SVIDs also. We think that JWT-SVIDs should be excluded, and have that documented properly. There is no need to make a change in the configuration name, but the implementation should be updated to not affect JWT-SVIDs. The documentation should then be clear that this affects agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reply.

We think that JWT-SVIDs should be excluded,

I understand. No problem for me either.

Comment on lines 61 to 63
if availabilityTarget > 0 {
log.Warn("SVID lifetime isn't long enough to guarantee the availability_target, fell back to the default rotation strategy")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the log should be done in the shouldFallbackDefaultRotationStrategy function. The current implementation already logs this every time that the SVID is being rotated. I'm not sure if I would suggest to log it only once. That could cause to remain unnoticed.


return ttl <= jitteredHalfLife
return ttl <= availabilityTarget
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we should preserve the jitter (with a reduced percentage in this case).

| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem |
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | |
| `workload_x509_svid_key_type` | The workload X509 SVID key type &lt;rsa-2048&vert;ec-p256&gt; | ec-p256 |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. Must be greater than 24h. | |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively SVIDs should be rotated. If set, must be greater than 24h. | |


If the `availability_target` is set, the agent will rotate an SVID when its remaining lifetime reaches the `availability_target`.

To guarantee the `availability_target`, grace period(`SVID lifetime - availability_target`) must be greater than 12h.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To guarantee the `availability_target`, grace period(`SVID lifetime - availability_target`) must be greater than 12h.
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h.

@hiyosi hiyosi force-pushed the configurable-svid-rotation branch from 4f06c55 to e39a464 Compare November 20, 2023 02:23
@hiyosi hiyosi force-pushed the configurable-svid-rotation branch 4 times, most recently from 3072757 to 349024b Compare November 29, 2023 05:37
@hiyosi
Copy link
Contributor Author

hiyosi commented Nov 29, 2023

I addressed comments.

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hiyosi for your patience. I think that we are pretty close here.
I left a few more comments.

| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem |
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | |
| `workload_x509_svid_key_type` | The workload X509 SVID key type &lt;rsa-2048&vert;ec-p256&gt; | ec-p256 |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively X509 SVIDs should be rotated. If set, must be greater than 24h. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be good to reference and add a link to the Availability Target section here.

@@ -105,6 +106,15 @@ These are the available profiles that can be set in the `profiling_freq` configu
- `trace`
- `cpu`

### Availability Target

NOTE: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
NOTE: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs.
_Note: The `availability_target` only affects the agent SVIDs and workload X509-SVIDs, but not JWT-SVIDs._

| `trust_bundle_format` | Format of the initial trust bundle, pem or spiffe | pem |
| `trust_domain` | The trust domain that this agent belongs to (should be no more than 255 characters) | |
| `workload_x509_svid_key_type` | The workload X509 SVID key type &lt;rsa-2048&vert;ec-p256&gt; | ec-p256 |
| `availability_target` | The minimum amount of time desired to gracefully handle SPIRE Server or Agent downtime. This configurable influences how aggressively X509 SVIDs should be rotated. If set, must be greater than 24h. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the logic is that the availability target must be at least 24, which allows setting a target of 24h. It should say that must be at least 24h instead of greater than 24h.


If the `availability_target` is set, the agent will rotate an X509 SVID when its remaining lifetime reaches the `availability_target`.

To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be greater than 12h.
To guarantee the `availability_target`, grace period (`SVID lifetime - availability_target`) must be at least 12h.

func shouldFallbackX509Default(lifetime, availabilityTarget time.Duration) bool {
// if the grace period less than the threshold, it should be felt back to the default rotation strategy
gracePeriod := lifetime - availabilityTarget
return gracePeriod < gracePeriodThreshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's easier to understand if we have a grace period of at least x hours.

Suggested change
return gracePeriod < gracePeriodThreshold
return gracePeriod <= gracePeriodThreshold

return nil, fmt.Errorf("unable to parse availability_target: %w", err)
}
if t < minimumAvailabilityTarget {
return nil, fmt.Errorf("availability_target must be greater than %s", minimumAvailabilityTarget.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("availability_target must be greater than %s", minimumAvailabilityTarget.String())
return nil, fmt.Errorf("availability_target must be at least %s", minimumAvailabilityTarget.String())

"spiffe_id": chain[0].URIs[0].String(),
"entry_id": entryID,
})
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, fall back to the default rotation strategy")
log.Warn("X509 SVID lifetime isn't long enough to guarantee the availability_target, falling back to the default rotation strategy")

@hiyosi hiyosi force-pushed the configurable-svid-rotation branch from 349024b to a8a0228 Compare December 12, 2023 23:30
@hiyosi
Copy link
Contributor Author

hiyosi commented Dec 14, 2023

@amartinezfayo Thank you for your review.
I updated the branch.

amartinezfayo
amartinezfayo previously approved these changes Dec 14, 2023
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hiyosi!

@amartinezfayo
Copy link
Member

@hiyosi There is a conflict due to a recent merged change in the main branch. Could you please resolve it so we can have this merged? Thanks!

Signed-off-by: Tomoya Usami <tousami@zlab.co.jp>
@hiyosi hiyosi force-pushed the configurable-svid-rotation branch from 4117352 to c2885c5 Compare December 14, 2023 23:05
@hiyosi
Copy link
Contributor Author

hiyosi commented Dec 14, 2023

I resolved the conflict.
One of the unit-tests (macos-latest) was failed on CI but I can pass all tests in my local environment.

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hiyosi for fixing the conflicts.
We squash the commits before merging, so there is no need to force push a single commit. Force pushing to a branch that is being reviewed makes the review more difficult, so we try to avoid that when is possible :)

Thanks again for this contribution, is greatly appreciated!

@amartinezfayo amartinezfayo merged commit 26b9a49 into spiffe:main Dec 15, 2023
32 checks passed
@hiyosi
Copy link
Contributor Author

hiyosi commented Dec 15, 2023

We squash the commits before merging, so there is no need to force push a single commit. Force pushing to a branch that is being reviewed makes the review more difficult, so we try to avoid that when is possible :)

I understand. I will respect that policy.

@hiyosi hiyosi deleted the configurable-svid-rotation branch December 15, 2023 02:17
@azdagron azdagron added this to the 1.8.7 milestone Dec 15, 2023
rushi47 pushed a commit to rushi47/spire that referenced this pull request Apr 11, 2024
Signed-off-by: Tomoya Usami <tousami@zlab.co.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Configurable SVID rotation threshold on Agents
5 participants