You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, issues with using SSA have been observed when it comes to reverting resources that are managed by controller A, but where another actor B adds fields to the resource that controller A isn't aware of and that should be reverted by controller A. For example:
another actor configures a volume for this deployment; lifecycle-manager isn't aware of this added volume (it is not part of the deployment specified in the module's raw manfiest)
The goal of this ADR is to align a uniform approach within Kyma that enables our controllers to revert undesired additions to Kyma-managed objects.
Expectations
The following cases summarize the expected scenarios where Kyma controllers are either expected to revert changes or to keep changes.
Expectation 1 - Revert resources to their Kyma-managed state
In general, it is expected that Kyma controllers revert additions to the resources they manage. This is also expressed by the disclaimer "DO NOT EDIT - This resource is managed by Kyma. Any modifications are discarded and the resource is reverted to the original state." present on resources managed by lifecycle-manager (currently not used uniformly). Any additions to those resources by the customer must be reverted. E.g.,
resources from the module's raw manifest
internal resources like SKR webhook-related ones
However, there are exceptions where Kyma provides the initial state of a resource, and it is expected that the customer extends this. In those cases, the changes must not be reverted. E.g.,
KymaCR where the customer modifies the modules
default CRs where the customer may change configuration
Expectation 2 - Don't revert additions by K8s
For some resources, K8s itself adds information to the resources. This is mostly limited to .metadata and .state. Those changes must not be reverted. E.g.,
kube-controller-manager extends .metadata.annotations and .status of a Deployment
Expectation 3 - Revert additions by SRE unless skip-reconcilation=true
SRE or other Kyma-internal actors may need to modify resources while troubleshooting. Those changes are also reverted unless the operator.kyma-project.io/skip-reconciliation=true label is present on the resource.
Expectation 4 - Don't revert modifications from other Kyma controllers
Some resources may be managed by multiple Kyma-internal controllers. It must be ensured that those don't end up in an endless "add-revert loop". E.g.,
Namespace kyma-system is controlled by lifecycle-manager, but environment-broker adds labels
CustomResourceDefinitions (CRDs) are controlled by lifecycle-manager, but module managers may add a conversion webhook
Expectation 5 - Revert resources when managing a previously unmanaged module
When re-managing a previously unmanaged module (see feature request), the Kyma controller takes care of a resource initially provided by the customer. In this case, all extensions by the customer not part of the original config shall be reverted.
Key Assumptions
Key Assumption 1 - Managers don't give up ownership properly
When working with SSA, the clean approach would be to give up ownership. For fields that should remain to exist, this would be done by transferring ownership to another manager. For fields that were added and are not necessary anymore, this should be done by sending another SSA without the field included. Experience from the tickets linked in Context shows that this is not always the case. Also, this is not the case when customers edit the resources manually.
Option 1 - Recreating the object
The situation could be solved by re-creating the object. The problem is however that this may cause temporary disruptions until the object is re-created and all expected managers have added their patches. Therefore, this is considered as a non-option.
Option 2 - Reverting added fields with SSA
💡 It is recommended to first get a rough understanding of how SSA and especially the mechanism of "Field management" works before reading further. See Server-Side Apply in K8s docs.
In summary, SSA allows the caller to modify a resource by providing the subset of fields the caller cares for. The K8s runtime takes care of either creating a new resource if it doesn't exist yet, or patching an existing resource with the provided fields while maintaining the remaining state. To keep track of who is controlling what part of the single resource, K8s stores the caller's identity as "field manager" together with the fields this manager controls in .metadata.managedFields. If multiple managers try to modify the same field, this results in a conflict and the update is rejected. For controllers, it is recommended that they always force conflicts on objects they own meaning that the controller overwrites the other managers value.
Due to the way how SSA is designed to work, a controller A using SSA to reconcile a resource is not able to revert an addition by actor B that controller A is not aware of without further consideration. In the following, a possible way of enabling our controllers to revert such additions is outlined.
Approach - Detect unexpected managers, remove them, and take full ownership of the resource
This option consists of three steps that are executed consecutively, and a fourth step which is the regular reconcile of the resource:
the controller detects that there are unexpected field managers present on the resource
the controller removes the unexpected field managers from the resource
the controller re-applies the full resource including the unknown fields to claim ownership of those fields
the controller re-applies its desired state of the object without the unknown fields (regular reconcile); this removes those fields again as the controller previously claimed their ownership and the consecutive call without those fields is interpreted as desire to remove those fields
In pseudo code, this may look as follows:
funcssaObject(obj info.Object) error {
// 0. regularly reconcile the obj, the obj argument is the desired state as retrieved from the module's raw manifest// after this call, obj is filled with the full state as currently in the clusterclnt.Patch(ctx, obj, client.Apply, client.ForceOwnership, OWNER_NAME)
managedFields:=obj.GetManagedFields()
needToTakeOwnership:=falsefori:=0; i<len(managedFields); i++ {
// 1. detect unexpected field managerifisUnexpectedFieldManager(managedFields[i].Manager) {
// 2. remove unexpected field manager from managedFieldspatch:= []byte(fmt.Sprintf(`[{"op": "remove", "path": "/metadata/managedFields/%d"}]`, i))
clnt.Patch(ctx, obj, client.RawPatch(types.JSONPatchType, patch))
needToTakeOwnership=true
}
}
// 3. if there was an unexpected field manager, take ownership by re-applying the full obj// as we don't want to change fields, just claim ownership, we also can omit the client.ForceOwnershipifneedToTakeOwnership {
clnt.Patch(ctx, obj, client.Apply, OWNER_NAME)
}
// 4. exit, the next regular recocnile will remove the unexpected fieldsreturnnil
}
In general, this approach works. It has been tested successfully with both, actor B using SSA and client-side apply. It does NOT work when controller A is using client-side-apply! However, it comes with the following challenges:
Challenge 1 - Detecting what field managers are unexpected
The controller needs to have information what field managers are expected per resource and which ones are not. Maintaining an allow-list per object appears too cumbersome.
As an alternative, it would be possible to create a convention that all controllers in Kyma MUST use a pattern for their field manager name like operator.kyma-project.io/<manager-name>. By this, all field managers not complying to this pattern could be filtered out as unexpected ones. Additionally, this would require SRE and any other Kyma-internal actor making changes to the object via kubectl to specify a field manager with the aligned prefix (kubectl flag --field-manager).
Also, the K8s internal managers like kube-controller-manager MUST be allow-listed.
Challenge 2 - Temporarily taking ownership from other expected managers
When re-sending the full object (step 3.), the controller also claims ownership of the fields owned by another expected controller. This should not be a problem since the controller gives up ownership again with the regular reconcile that doesn't include the other field managers fields (step 4.).
Challenge 3 - Automatic or on-demand action
It needs to be decided whether this reverting approach is executed automatically, or if it is executed on demand. On-demand would mean to actively trigger this approach to be executed. This could be done by adding an annotation to the resource. With this approach, it could also be decided whether the revert is executed on all unexpected field managers as described above, or if only a field manager named in the annotation value is removed.
Challenge 4 - Giving up ownership
If decided to follow an approach where we detect known field managers based on a name prefix, Kyma controller A does not revert additions from Kyma controller B. Consequently, it would not automatically solve the problem described in #941.
The proper approach for this case would be that the module-manager (controller B) still gives up ownership properly by sending an SSA with an empty object.
Option 3 - Not reverting added fields
Additions to the resource are not reverted. Instead, additions made by the user are ignored. If they lead to errors in the cluster, they are considered user errors.
When it comes to multiple Kyma controllers handling the same resource, the controllers must ensure that they properly give up ownership of patches they added when not needed anymore. I.e, they must send an SSA with an empty object. This will remove their additions to the object and the field manager entry.
Option 4 - Prevent updates with a validation webhook
Instead of reverting changes done by unkown actors, they could also be prevented in the first place.
The idea is to configure validating webhoooks that intercept all changes on kyma-managed resources.
If an unkown actor is trying to change the resource, the validating webhook should fail.
Challenge 1 - Detecting what resources to validate
It must be determined what resources to validate. This may be achieved by namespaceSelector and objectSelector (see ValidatingWebhookConfiguration) filtering all objects labelled with operator.kyma-project.io/managed-by: kyma.
In the webhook, we would need to be able to configure exceptions for KymaCR and DefaultCRs.
Challenge 2 - Detecting what actors are unexpected.
This could be solved in the same way as proposed in Option 2. AdmissionRequests contain an options field listing the FieldManager, e.g., see CreateOptions.
Challenge 3 - Handling the load
The validation webhook may need to handle a lot of load and should not add much latency to the change requests.
A central webhook in KCP managing all clusters may be facing too much load.
Instead, there should be one webhook per SKR where lifecycle manager ensures the webhook is present in the SKR.
Challenge 4 - Rotating the certificates
Webhooks must secure communication via certificates that need to be rotated.
This may be achieved by re-using the existing certificates for the SKR webhook that lifecycle manager ensures in the SKR.
Challenge 5 - Single Point of Failure
If this webhook breaks down for whatever reason, this would prevent all changes to Kyma-managed resources in the SKR.
A mitigation is to use failurePolicy: ignore. This will make the admission request succeed on unknown errors from the webhook. The downside is that this allows unexpected changes while the webhook is down or unavailable, e.g. when the cluster starts. Those changes are not reverted in the future but stick.
Decisions
It is decided to pursue Option 2 - Reverting added fields with SSA. Kyma controllers shall revert changes from unexpected managers via the approach outlined. Is is explicitly accepted that the approach for identifying expected managers by the naming pattern operator.kyma-project.io/<manager-name> is not providing genuine authenticity.
Major arguments against Option 4 - Prevent updates with a validation webhook are the added latency to each request, and the more native behavior of Option 2 (we reconcile to a desired state).
In a first step, lifecycle-manager will implement the detection of unexpected managers (PoC). It will not revert those changes but write observability data to get an understanding of the changes that would be reverted. Once verified that this matches the expectations, the approach will be rolled out to revert these changes.
Consequences
Jellyfish (PoC):
implement the detection of unexpected managers (first with an allow-list, later with the naming pattern)
write observability data of what changes would be reverted
Jellyfish (once approach is confirmed by PoC):
rename their field-managers to follow the pattern operator.kyma-project.io/<manager-name>
implement the reverting logic
check whether the pattern could be implemented as a re-use library
Kyma-teams with operators writing SKR resources (once approach is confirmed by PoC):
rename their field-managers to follow the pattern operator.kyma-project.io/<manager-name>
implement the reverting logic where applicable
give up ownership properly where needed (see Option 2, Challenge 4)
Addendum
Notes
client-side applying an existing object with a new field creates a field manager entry for this new field but DOES NOT take ownership of the existing field
in contrast, SSA DOES take ownership of the existing fields
The text was updated successfully, but these errors were encountered:
Created on 2024-11-29 by Christoph Schwägerl (@c-pius).
Decision log
Re-proposed on 2025-01-08
(v0.0.2)Changelog
v0.0.1
v0.0.2
skip-reconcilation=true
; this leads to a new challengeContext
Controllers within Kyma often use Server-Side Apply (SSA) for reconciling the resources they manage.
Recently, issues with using SSA have been observed when it comes to reverting resources that are managed by controller A, but where another actor B adds fields to the resource that controller A isn't aware of and that should be reverted by controller A. For example:
Real examples where this has been problematic:
The goal of this ADR is to align a uniform approach within Kyma that enables our controllers to revert undesired additions to Kyma-managed objects.
Expectations
The following cases summarize the expected scenarios where Kyma controllers are either expected to revert changes or to keep changes.
Expectation 1 - Revert resources to their Kyma-managed state
In general, it is expected that Kyma controllers revert additions to the resources they manage. This is also expressed by the disclaimer "DO NOT EDIT - This resource is managed by Kyma. Any modifications are discarded and the resource is reverted to the original state." present on resources managed by lifecycle-manager (currently not used uniformly). Any additions to those resources by the customer must be reverted. E.g.,
However, there are exceptions where Kyma provides the initial state of a resource, and it is expected that the customer extends this. In those cases, the changes must not be reverted. E.g.,
Expectation 2 - Don't revert additions by K8s
For some resources, K8s itself adds information to the resources. This is mostly limited to
.metadata
and.state
. Those changes must not be reverted. E.g.,kube-controller-manager
extends.metadata.annotations
and.status
of a DeploymentExpectation 3 - Revert additions by SRE unless
skip-reconcilation=true
SRE or other Kyma-internal actors may need to modify resources while troubleshooting. Those changes are also reverted unless the
operator.kyma-project.io/skip-reconciliation=true
label is present on the resource.Expectation 4 - Don't revert modifications from other Kyma controllers
Some resources may be managed by multiple Kyma-internal controllers. It must be ensured that those don't end up in an endless "add-revert loop". E.g.,
kyma-system
is controlled by lifecycle-manager, but environment-broker adds labelsExpectation 5 - Revert resources when managing a previously unmanaged module
When re-managing a previously unmanaged module (see feature request), the Kyma controller takes care of a resource initially provided by the customer. In this case, all extensions by the customer not part of the original config shall be reverted.
Key Assumptions
Key Assumption 1 - Managers don't give up ownership properly
When working with SSA, the clean approach would be to give up ownership. For fields that should remain to exist, this would be done by transferring ownership to another manager. For fields that were added and are not necessary anymore, this should be done by sending another SSA without the field included. Experience from the tickets linked in Context shows that this is not always the case. Also, this is not the case when customers edit the resources manually.
Option 1 - Recreating the object
The situation could be solved by re-creating the object. The problem is however that this may cause temporary disruptions until the object is re-created and all expected managers have added their patches. Therefore, this is considered as a non-option.
Option 2 - Reverting added fields with SSA
In summary, SSA allows the caller to modify a resource by providing the subset of fields the caller cares for. The K8s runtime takes care of either creating a new resource if it doesn't exist yet, or patching an existing resource with the provided fields while maintaining the remaining state. To keep track of who is controlling what part of the single resource, K8s stores the caller's identity as "field manager" together with the fields this manager controls in
.metadata.managedFields
. If multiple managers try to modify the same field, this results in a conflict and the update is rejected. For controllers, it is recommended that they always force conflicts on objects they own meaning that the controller overwrites the other managers value.Due to the way how SSA is designed to work, a controller A using SSA to reconcile a resource is not able to revert an addition by actor B that controller A is not aware of without further consideration. In the following, a possible way of enabling our controllers to revert such additions is outlined.
Approach - Detect unexpected managers, remove them, and take full ownership of the resource
This option consists of three steps that are executed consecutively, and a fourth step which is the regular reconcile of the resource:
In pseudo code, this may look as follows:
In general, this approach works. It has been tested successfully with both, actor B using SSA and client-side apply. It does NOT work when controller A is using client-side-apply! However, it comes with the following challenges:
Challenge 1 - Detecting what field managers are unexpected
The controller needs to have information what field managers are expected per resource and which ones are not. Maintaining an allow-list per object appears too cumbersome.
As an alternative, it would be possible to create a convention that all controllers in Kyma MUST use a pattern for their field manager name like
operator.kyma-project.io/<manager-name>
. By this, all field managers not complying to this pattern could be filtered out as unexpected ones. Additionally, this would require SRE and any other Kyma-internal actor making changes to the object viakubectl
to specify a field manager with the aligned prefix (kubectl
flag--field-manager
).Also, the K8s internal managers like
kube-controller-manager
MUST be allow-listed.Challenge 2 - Temporarily taking ownership from other expected managers
When re-sending the full object (step 3.), the controller also claims ownership of the fields owned by another expected controller. This should not be a problem since the controller gives up ownership again with the regular reconcile that doesn't include the other field managers fields (step 4.).
Challenge 3 - Automatic or on-demand action
It needs to be decided whether this reverting approach is executed automatically, or if it is executed on demand. On-demand would mean to actively trigger this approach to be executed. This could be done by adding an annotation to the resource. With this approach, it could also be decided whether the revert is executed on all unexpected field managers as described above, or if only a field manager named in the annotation value is removed.
Challenge 4 - Giving up ownership
If decided to follow an approach where we detect known field managers based on a name prefix, Kyma controller A does not revert additions from Kyma controller B. Consequently, it would not automatically solve the problem described in #941.
The proper approach for this case would be that the module-manager (controller B) still gives up ownership properly by sending an SSA with an empty object.
Option 3 - Not reverting added fields
Additions to the resource are not reverted. Instead, additions made by the user are ignored. If they lead to errors in the cluster, they are considered user errors.
When it comes to multiple Kyma controllers handling the same resource, the controllers must ensure that they properly give up ownership of patches they added when not needed anymore. I.e, they must send an SSA with an empty object. This will remove their additions to the object and the field manager entry.
Option 4 - Prevent updates with a validation webhook
Instead of reverting changes done by unkown actors, they could also be prevented in the first place.
The idea is to configure validating webhoooks that intercept all changes on kyma-managed resources.
If an unkown actor is trying to change the resource, the validating webhook should fail.
Challenge 1 - Detecting what resources to validate
It must be determined what resources to validate. This may be achieved by
namespaceSelector
andobjectSelector
(see ValidatingWebhookConfiguration) filtering all objects labelled withoperator.kyma-project.io/managed-by: kyma
.In the webhook, we would need to be able to configure exceptions for KymaCR and DefaultCRs.
Challenge 2 - Detecting what actors are unexpected.
This could be solved in the same way as proposed in Option 2. AdmissionRequests contain an
options
field listing the FieldManager, e.g., see CreateOptions.Challenge 3 - Handling the load
The validation webhook may need to handle a lot of load and should not add much latency to the change requests.
A central webhook in KCP managing all clusters may be facing too much load.
Instead, there should be one webhook per SKR where lifecycle manager ensures the webhook is present in the SKR.
Challenge 4 - Rotating the certificates
Webhooks must secure communication via certificates that need to be rotated.
This may be achieved by re-using the existing certificates for the SKR webhook that lifecycle manager ensures in the SKR.
Challenge 5 - Single Point of Failure
If this webhook breaks down for whatever reason, this would prevent all changes to Kyma-managed resources in the SKR.
A mitigation is to use
failurePolicy: ignore
. This will make the admission request succeed on unknown errors from the webhook. The downside is that this allows unexpected changes while the webhook is down or unavailable, e.g. when the cluster starts. Those changes are not reverted in the future but stick.Decisions
It is decided to pursue Option 2 - Reverting added fields with SSA. Kyma controllers shall revert changes from unexpected managers via the approach outlined. Is is explicitly accepted that the approach for identifying expected managers by the naming pattern
operator.kyma-project.io/<manager-name>
is not providing genuine authenticity.Major arguments against Option 4 - Prevent updates with a validation webhook are the added latency to each request, and the more native behavior of Option 2 (we reconcile to a desired state).
In a first step, lifecycle-manager will implement the detection of unexpected managers (PoC). It will not revert those changes but write observability data to get an understanding of the changes that would be reverted. Once verified that this matches the expectations, the approach will be rolled out to revert these changes.
Consequences
operator.kyma-project.io/<manager-name>
operator.kyma-project.io/<manager-name>
Addendum
Notes
The text was updated successfully, but these errors were encountered: