-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Only refresh bootstrap token if needed, requeue in all cases where node hasn't joined yet #9229
🐛 Only refresh bootstrap token if needed, requeue in all cases where node hasn't joined yet #9229
Conversation
Hi @AndiDog. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token") | ||
} | ||
|
||
if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh if 1/6-th of the TTL has passed. We can still change this, but since RequeueAfter
is now consistently r.TokenTTL / 3
(which I consider a sane value even for intermittent errors), the next regular reconciliation would refresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a code comment here that explains this
Since this value needs to be tightly coupled with tokenCheckRefreshOrRotationInterval
, we might want to either define them in the same place in the code or make them depend on each other so we don't inadvertently change one but not the other in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a helper function so this is found next to func tokenCheckRefreshOrRotationInterval
, and added comments how both are related
return ctrl.Result{}, errors.Wrapf(err, "failed to refresh bootstrap token") | ||
} | ||
return ctrl.Result{ | ||
RequeueAfter: r.TokenTTL / 2, | ||
RequeueAfter: r.TokenTTL / 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency. Also, if we only reconcile twice within the TTL, the first reconciliation at halftime (TTL / 2) must not fail. Better give more chances if there are intermittent errors.
@@ -554,6 +584,9 @@ func (r *KubeadmConfigReconciler) joinWorker(ctx context.Context, scope *Scope) | |||
if res, err := r.reconcileDiscovery(ctx, scope.Cluster, scope.Config, certificates); err != nil { | |||
return ctrl.Result{}, err | |||
} else if !res.IsZero() { | |||
if res.RequeueAfter == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the below code line change were IMO missing cases where the reconciler returned that no requeueing is needed, but we need that while nodes haven't joined since the token may expire. It's covered in a changed test, so please double-check if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change is necessary.
reconcileDiscovery takes care of creating the token, it is not linked to the refresh token process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It creates the bootstrap token (via the function createToken
) and sets expiry like this:
bootstrapapi.BootstrapTokenExpirationKey: []byte(time.Now().UTC().Add(ttl).Format(time.RFC3339)),
So as soon as that happened, we need to periodically refresh. But since we check for !res.IsZero()
(= "does the result requeue?"), I also think this change doesn't really make sense at this location. Removing it.
@@ -634,7 +667,7 @@ func (r *KubeadmConfigReconciler) joinWorker(ctx context.Context, scope *Scope) | |||
scope.Error(err, "Failed to store bootstrap data") | |||
return ctrl.Result{}, err | |||
} | |||
return ctrl.Result{}, nil | |||
return ctrl.Result{RequeueAfter: r.TokenTTL / 3}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are changing this here, please add a comment explaining why we are requeing, eg.
// joinWorker completed, ensuring to reconcile this object again so we keep refreshing the bootstrap token until it is consumed
Also the same change should be applied to joinControlPlane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both done 👌
@@ -739,7 +771,8 @@ func (r *KubeadmConfigReconciler) joinControlplane(ctx context.Context, scope *S | |||
return ctrl.Result{}, err | |||
} | |||
|
|||
return ctrl.Result{}, nil | |||
// Ensure reconciling this object again so we keep refreshing the bootstrap token until it is consumed | |||
return ctrl.Result{RequeueAfter: r.TokenTTL / 3}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: given this is using r.TokenTTL / 3 in multiple places I'd put it in a var e.g tokenRequeueAfter that is invoked everywhere so there's no chance to skew in future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This API diff was detected as test error:
What do I need to do here? I changed the function signature on purpose. |
@enxebre @fabriziopandini The requested changes are included. We found in chat that the apidiff test error is not mandatory, so I think we should go ahead since only a test function signature was changed. If that change makes sense, please have a look since I think this PR is ready now. |
changes looks good to me, but it would be great if someone else give a look at this /test pull-cluster-api-e2e-full-main |
@enxebre Would you like to have a closer look? |
@enxebre @JoelSpeed could you review this one? Thank you. |
security people might not agree with keeping bootstrap tokens for longer periods of time if CAPI is keeping them as a maximum of 24h, sgtm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall, a few minor comments
|
||
if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) { | ||
// We still have lots of time until it expires | ||
log.Info("Token needs no refresh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to log how much time is left before expiration, maybe as a higher verbosity for debugging purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to V(2)
and added tokenExpiresInSeconds
log field
return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token") | ||
} | ||
|
||
if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a code comment here that explains this
Since this value needs to be tightly coupled with tokenCheckRefreshOrRotationInterval
, we might want to either define them in the same place in the code or make them depend on each other so we don't inadvertently change one but not the other in the future.
We have refresh (token does not expire while the node didn't join) vs. rotation (machine pools only – keep creating a new token so that new nodes can start up at any time). There's no time limit for the refresh case, though. If we think this is problematic, we should tackle it in a separate issue. My PR only fixes the constant refreshing (= changing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the review comments and also tested the change manually.
return ctrl.Result{}, errors.Wrapf(err, "can't parse expiration time of bootstrap token") | ||
} | ||
|
||
if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a helper function so this is found next to func tokenCheckRefreshOrRotationInterval
, and added comments how both are related
|
||
if expiration.After(time.Now().UTC().Add(r.TokenTTL * 5 / 6)) { | ||
// We still have lots of time until it expires | ||
log.Info("Token needs no refresh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to V(2)
and added tokenExpiresInSeconds
log field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @fabriziopandini @enxebre
LGTM label has been added. Git tree hash: b8685357062becbf5762a0b9443cf49999eb7c98
|
|
||
secretExpiration := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenExpirationKey) | ||
if secretExpiration == "" { | ||
log.Info(fmt.Sprintf("Token has no valid value for %s, writing new expiration timestamp", bootstrapapi.BootstrapTokenExpirationKey)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case would the expiration be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be, as of CAPA code. However external operators and humans interact with Kubernetes as well, so this is regular error handling code for the mere theoretical possibility of this happening.
bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go
Outdated
Show resolved
Hide resolved
now := time.Now().UTC() | ||
skipTokenRefreshIfExpiringAfter := now.Add(r.skipTokenRefreshIfExpiringAfter()) | ||
if expiration.After(skipTokenRefreshIfExpiringAfter) { | ||
// We still have lots of time until it expires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment says we have "lots of time", although the check above only checks that the expiration is within the window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"lots" seems misleading indeed. It's rather "enough time as of call to skipTokenRefreshIfExpiringAfter()". The log message provides enough code explanation here, so I removed the comment.
// skipTokenRefreshIfExpiringAfter returns a duration from "now". If the token's expiry timestamp is after | ||
// `now + skipTokenRefreshIfExpiringAfter()`, it does not yet need a refresh. | ||
func (r *KubeadmConfigReconciler) skipTokenRefreshIfExpiringAfter() time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the name/comment a leftover? The function just returns a duration, rather than a duration from now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@vincepri Please have a look. I've pushed merge+changes so you can still see in GitHub what exactly has changed (hopefully). I can squash to a single commit once all looks okay. |
ok good to squash for me! |
563213a
to
80cc163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@sbueringer do you want to give it a pass?
Yup would be good. I hope I get around to it in the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go
Outdated
Show resolved
Hide resolved
bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go
Outdated
Show resolved
Hide resolved
…de hasn't joined yet
b620426
to
9529aea
Compare
@sbueringer I applied both suggestions. |
@AndiDog: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Thx! /lgtm /assign @fabriziopandini |
LGTM label has been added. Git tree hash: 814237ec9de63ed8a38fa634944394d546e63898
|
/test pull-cluster-api-e2e-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Right now, the bootstrap token is refreshed every time until nodes have joined. This creates unnecessary etcd writes. Only extend the expiry timestamp (called "refresh" in code) if we're getting a bit closer to reaching it. Particularly if KubeadmConfig reconciliation is triggered from other sources than CAPA's
RequeueAfter
interval, this can pile up to quite noisy logs and API accesses.Along the way, I did some other improvements and will add inline comments to explain.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):This was a drive-by along the way of fixing giantswarm/roadmap#2217, but isn't really related.
/area provider/bootstrap-kubeadm
/area testing