Fix #8132: managed fields failure during Restore #8133

mpryc · 2024-08-20T08:03:24Z

Thank you for contributing to Velero!

Please add a summary of your change

This commit addresses issue #8132, where an error randomly appears in the logs during the restore operation.

The error occurs due to a race condition when attempting to patch managed fields on an object that has been modified in the cluster. The error message indicates that the operation cannot be fulfilled because the object has been modified, suggesting that changes should be applied to the latest version.

To resolve this, a retry mechanism has been implemented in the restore process when encountering this error, ensuring that managed fields are properly restored without the error message appearing in the logs.

Does your change fix a particular issue?

Fixes #8132

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.
- not needed

mpryc · 2024-08-20T08:03:32Z

/kind changelog-not-required

anshulahuja98 · 2024-08-20T14:10:39Z

I would prefer if this is backported to 1.13 as well

blackpiglet · 2024-08-20T15:10:24Z

@mpryc
I suggest adding a changelog file for this PR.

codecov · 2024-08-20T15:17:18Z

Codecov Report

Attention: Patch coverage is 37.03704% with 17 lines in your changes missing coverage. Please review.

Project coverage is 59.07%. Comparing base (86963bf) to head (b091d49).
Report is 14 commits behind head on main.

Files	Patch %	Lines
pkg/restore/restore.go	37.03%	13 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8133      +/-   ##
==========================================
+ Coverage   58.99%   59.07%   +0.08%     
==========================================
  Files         364      364              
  Lines       30270    30310      +40     
==========================================
+ Hits        17858    17906      +48     
+ Misses      10965    10959       -6     
+ Partials     1447     1445       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kaovilai · 2024-08-21T00:19:10Z

Potential for scope overlap with
#8063

mpryc · 2024-08-21T08:08:50Z

@mpryc I suggest adding a changelog file for this PR.

Added, dunno how to remove kind/changelog-not-required label once it has been added, possibly the owner of the repo can do that.

mpryc · 2024-08-21T08:19:02Z

Potential for scope overlap with #8063

I am not convinced about that. The #8063 is about retry on status update, this fix is about race condition on a particular object during restore operation and applying managed fields. It could be a general design rework of restore operation, but that's bigger chunk of work.

In short this fix addresses situation when after first time the object is created in the cluster at:

velero/pkg/restore/restore.go

Line 1503 in f63b714

createdObj, restoreErr = resourceClient.Create(obj)

And before the patch for managed fields is being calculated at:

velero/pkg/restore/restore.go

Line 1672 in f63b714

withoutManagedFields := createdObj.DeepCopy()

There are number of operations on the in-cluster object including status update and it happens that the object being patched is not the one that represents current cluster version. This is of course done to save API calls to the cluster and we should only re-try such operation when there is real error.

I believe this is not really what you are looking into within #8063 as we are explicitly retrying on the object conflict and not other problems such as non reachable cluster API:

https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict

sseago · 2024-08-21T16:07:02Z

@kaovilai I agree with @mpryc This seems to be a completely different retry needed. Not velero CRs -- restored resources, and not related to APIServer outages but conflict errors, so both the error matched and the backoff/duration profile is completely different.

sseago · 2024-08-21T16:08:08Z

@mpryc I removed the changelog-not-required label

sseago · 2024-08-21T16:13:43Z

changelogs/unreleased/8132-mpryc

@@ -0,0 +1 @@
+Random race condition in the restore with managed fields


@mpryc changelog needs to have PR number, not issue number -- s/8132/8133/ in the filename.

Also, I'd rephrase the changelog to describe the fix rather than the bug (since it will appear in release notes):
"Fixed race condition for conflicts on patching managed fields" or soething like that.

This commit addresses issue vmware-tanzu#8132, where an error randomly appears in the logs during the restore operation. The error occurs due to a race condition when attempting to patch managed fields on an object that has been modified in the cluster. The error message indicates that the operation cannot be fulfilled because the object has been modified, suggesting that changes should be applied to the latest version. To resolve this, a retry mechanism has been implemented in the restore process when encountering this error, ensuring that managed fields are properly restored without the error message appearing in the logs. Signed-off-by: Michal Pryc <mpryc@redhat.com>

ywk253100 · 2024-08-28T05:58:02Z

@mpryc The patch API should not report the error mentioned in issue #8132, the reason causes #8132 is that the resourceVersion field is included when doing the patch (the resourceVersion is changed between the time window object is created and patched).

So a more reasonable solution would be that make sure the resourceVersion not being included in the patch data, e.g. make the resourceVersion same when compute the patch data.

ywk253100

See my comments

mpryc · 2024-08-28T13:46:15Z

So a more reasonable solution would be that make sure the resourceVersion not being included in the patch data, e.g. make the resourceVersion same when compute the patch data.

@ywk253100 won't this have possible other issues? There is a risk of applying a patch based on an out-of-sync version of the object. This can lead to unintended modifications, conflicts, or even overwriting of data that has been updated since the original resourceVersion was retrieved. Kubernetes relies on an optimistic concurrency control model in which we may be breaking by assuming we always have correct version in our cache.

ywk253100 · 2024-08-29T09:34:21Z

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.

withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.

mpryc · 2024-08-30T10:11:13Z

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.
withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)
I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.

@ywk253100 Any attempt to patch object in the cluster including other then resourceVersion fields like status, can result in a conflict if the patch is generated from an outdated in-memory version of the object.

When the patch is applied, it may conflict with the current state of the object, leading to a conflict error. To address this, it’s important to implement a retry mechanism. Specifically defined for such scenarios, client-go/util/retry allows the operation to be retried using the latest version of the object when a conflict occurs.

I believe the correct fix is actually to perform retry on conflict which will ensure only this scenario is taken into consideration:

https://github.com/kubernetes/client-go/blob/02a19c375c491042890be396d94a26a69da89563/util/retry/util.go#L68-L75

ywk253100 · 2024-09-02T06:00:18Z

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.
withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)
I'm curious why the patch operation reported the confliction error? Only patch with resourceVersion or update could report such error.
@ywk253100 Any attempt to patch object in the cluster including other then resourceVersion fields like status, can result in a conflict if the patch is generated from an outdated in-memory version of the object.

When the patch is applied, it may conflict with the current state of the object, leading to a conflict error. To address this, it’s important to implement a retry mechanism. Specifically defined for such scenarios, client-go/util/retry allows the operation to be retried using the latest version of the object when a conflict occurs.

I believe the correct fix is actually to perform retry on conflict which will ensure only this scenario is taken into consideration:

https://github.com/kubernetes/client-go/blob/02a19c375c491042890be396d94a26a69da89563/util/retry/util.go#L68-L75

Hi @mpryc If you see the code I pasted, the only difference between the two objects is the ManagedFields, so the patch data should only contain the changed ManagedFields part, right? Do you mean this kind of patch is also possible to cause conflict?
Per my understanding, it is not. Correct me if I'm wrong.
And the Kubernetes API doc says

Patches will never cause optimistic locking failures, and the last write will win. Patches are recommended when the full state is not read before an update, or when failing on optimistic locking is undesirable. When patching complex types, arrays and maps, how the patch is applied is defined on a per-field basis and may either replace the field's current value, or merge the contents into the current value.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/

mpryc · 2024-09-02T13:39:02Z

@mpryc I checked the code again, seems the resourceVersion isn't included in the patch data, the only difference between the withoutManagedFields and createdObj is the managed fields.
withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj)

Hi @mpryc If you see the code I pasted, the only difference between the two objects is the ManagedFields, so the patch data should only contain the changed ManagedFields part, right?

@ywk253100 I agree with you. While the chances of a conflict error due to the generated patch are minimal, they are still possible, as evidenced by the errors we see in our logs.

I can't think of any other scenarios that could be causing this specific error. Implementing retries on this patch seems to be a reasonable approach to address the issue.

The error the object has been modified; please apply your changes to the latest version and try again in the logs is clearly linked to this line and most likely results from a race condition. Although I can't think of any other possible causes, the problem persists. If you have suggestions for a better fix, I'm open to testing them:

velero/pkg/restore/restore.go

Line 1681 in 3408ffe

if _, err = resourceClient.Patch(name, patchBytes); err != nil {

ywk253100 · 2024-09-03T08:20:44Z

@mpryc
I checked the Kubernetes API server code, and seems only PATCH with resourceVersion and PUT API calls could cause the conflict error. If what I'm saying is right, the PATCH without resourceVersion should not cause the error.

If the PATCH operation could cause the conflict error, we need to add the retry logic for every call of PATCH.
So let's spend a bit more time to make sure the root cause before merging the PR, WDYT?

BTW, in which version of Velero and Kubernetes that you see the issue? Is it reproducible?
I saw the failed resource is a namespace, is this issue specific to namespaces or not?

mpryc · 2024-09-03T10:24:16Z

@ywk253100
That was visible on Velero 1.14. It's hard for me to check what the generated patch looked like and if it had the resourceVersion as this is random error visible post CI run in our logs, so more data is lost, but if needed I could build custom velero and collect more data for it. Would the original object, patched object, generated patch be sufficient in more debugging?

ywk253100 · 2024-09-04T07:28:42Z

@ywk253100 That was visible on Velero 1.14. It's hard for me to check what the generated patch looked like and if it had the resourceVersion as this is random error visible post CI run in our logs, so more data is lost, but if needed I could build custom velero and collect more data for it. Would the original object, patched object, generated patch be sufficient in more debugging?

@mpryc That's OK for the first step of further debugging.

BTW, seems the managedFields patching logic will not apply to the namespace resource in the normal restore process, it is skipped here https://github.com/vmware-tanzu/velero/blob/v1.14.1/pkg/restore/restore.go#L739. The only possibility the patching logic can apply to namespace is processing the additionalItems returns by the RestoreItemAction: https://github.com/vmware-tanzu/velero/blob/v1.14.1/pkg/restore/restore.go#L1393

Is there any RestoreItemAction in your deployment that returns the namespace as the additionalItems?

Please also check whether the conflict error is reported only for namespace resource or not in your following debugging.

github-actions bot assigned mpryc Aug 20, 2024

github-actions bot requested review from shubham-pampattiwar and sseago August 20, 2024 08:03

github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Aug 20, 2024

mpryc force-pushed the managed-fields-fix branch from d4ec8f2 to 097a9f8 Compare August 21, 2024 08:06

github-actions bot added the has-changelog label Aug 21, 2024

sseago removed the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Aug 21, 2024

sseago closed this Aug 21, 2024

sseago reopened this Aug 21, 2024

github-actions bot requested review from Lyndon-Li and ywk253100 August 21, 2024 16:10

sseago reviewed Aug 21, 2024

View reviewed changes

mpryc force-pushed the managed-fields-fix branch from 097a9f8 to b091d49 Compare August 21, 2024 20:26

sseago approved these changes Aug 27, 2024

View reviewed changes

ywk253100 requested changes Aug 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #8132: managed fields failure during Restore #8133

Fix #8132: managed fields failure during Restore #8133

mpryc commented Aug 20, 2024

mpryc commented Aug 20, 2024

anshulahuja98 commented Aug 20, 2024

blackpiglet commented Aug 20, 2024 •

edited

Loading

codecov bot commented Aug 20, 2024 •

edited

Loading

kaovilai commented Aug 21, 2024

mpryc commented Aug 21, 2024

mpryc commented Aug 21, 2024

sseago commented Aug 21, 2024

sseago commented Aug 21, 2024

sseago Aug 21, 2024

mpryc Aug 21, 2024

ywk253100 commented Aug 28, 2024

ywk253100 left a comment

mpryc commented Aug 28, 2024

ywk253100 commented Aug 29, 2024

mpryc commented Aug 30, 2024 •

edited

Loading

ywk253100 commented Sep 2, 2024

mpryc commented Sep 2, 2024 •

edited

Loading

ywk253100 commented Sep 3, 2024 •

edited

Loading

mpryc commented Sep 3, 2024

ywk253100 commented Sep 4, 2024

		@@ -0,0 +1 @@
		Random race condition in the restore with managed fields

Fix #8132: managed fields failure during Restore #8133

Are you sure you want to change the base?

Fix #8132: managed fields failure during Restore #8133

Conversation

mpryc commented Aug 20, 2024

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

mpryc commented Aug 20, 2024

anshulahuja98 commented Aug 20, 2024

blackpiglet commented Aug 20, 2024 • edited Loading

codecov bot commented Aug 20, 2024 • edited Loading

Codecov Report

kaovilai commented Aug 21, 2024

mpryc commented Aug 21, 2024

mpryc commented Aug 21, 2024

sseago commented Aug 21, 2024

sseago commented Aug 21, 2024

sseago Aug 21, 2024

Choose a reason for hiding this comment

mpryc Aug 21, 2024

Choose a reason for hiding this comment

ywk253100 commented Aug 28, 2024

ywk253100 left a comment

Choose a reason for hiding this comment

mpryc commented Aug 28, 2024

ywk253100 commented Aug 29, 2024

mpryc commented Aug 30, 2024 • edited Loading

ywk253100 commented Sep 2, 2024

mpryc commented Sep 2, 2024 • edited Loading

ywk253100 commented Sep 3, 2024 • edited Loading

mpryc commented Sep 3, 2024

ywk253100 commented Sep 4, 2024

blackpiglet commented Aug 20, 2024 •

edited

Loading

codecov bot commented Aug 20, 2024 •

edited

Loading

mpryc commented Aug 30, 2024 •

edited

Loading

mpryc commented Sep 2, 2024 •

edited

Loading

ywk253100 commented Sep 3, 2024 •

edited

Loading