Mark InProgress backup/restore as failed upon requeuing #7863

kaovilai · 2024-06-05T18:33:57Z

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7207

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

pkg/controller/backup_controller.go

codecov · 2024-06-06T21:02:28Z

Codecov Report

Attention: Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 58.78%. Comparing base (a8d77ea) to head (5e52668).

Files	Patch %	Lines
pkg/controller/restore_controller.go	52.38%	9 Missing and 1 partial ⚠️
pkg/controller/backup_controller.go	66.66%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7863      +/-   ##
==========================================
- Coverage   58.80%   58.78%   -0.02%     
==========================================
  Files         345      345              
  Lines       28759    28777      +18     
==========================================
+ Hits        16911    16917       +6     
- Misses      10420    10431      +11     
- Partials     1428     1429       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai

Just need to add unit tests and will mark ready for review.

kaovilai · 2024-06-10T14:32:41Z

pkg/controller/restore_controller.go

+		// return the error so the status can be re-processed; it's currently still not completed or failed
+		return ctrl.Result{}, err


This is where we patch completion status on restore. We can store completion in memory so that subsequent requeues patch status to complete rather than to fail.

in memory or on disk? If in-memory, we just need to be careful about thread issues (to handle future cases where we have multiple reconcilers).

Also, we have the regular controller and the finalizer controller, so two different places/patches to deal with. i.e. which CR and which reconciler/controller.

Finalizer controllers won't mark backup/restore as completed right? This memory is limited to Completed phase.

Actually, the finalizer controller is the only one that marks it completed. Backup/restore controller moves it from InProgress to Finalizing. Both of these transitions need requeue-to-fail-or-update though.

ack Thanks for the catch.

Lyndon-Li · 2024-06-12T11:22:00Z

pkg/controller/backup_controller.go

@@ -307,7 +320,10 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
 	log.Info("Updating backup's final status")
 	if err := kubeutil.PatchResource(original, request.Backup, b.kbClient); err != nil {
 		log.WithError(err).Error("error updating backup's final status")
+		// return the error so the status can be re-processed; it's currently still not completed or failed


I remember the CR won't be re-enqueued when returning an error from the reconciler, could you confirm this?

We had confirmed that returning error would cause reconciler to requeue.

This is equivalent to returning with requeue: true but with more information.

See: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.18.4/pkg/reconcile#Reconciler

// If the returned error is non-nil, the Result is ignored and the request will be // requeued using exponential backoff. The only exception is if the error is a // TerminalError in which case no requeuing happens.

Lyndon-Li · 2024-06-12T11:32:12Z

pkg/controller/backup_controller.go

@@ -230,6 +230,20 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
 	switch original.Status.Phase {
 	case "", velerov1api.BackupPhaseNew:
 		// only process new backups
+	case velerov1api.BackupPhaseInProgress:


I suggest we don't add this change, this assumes the inProgress CR won't be re-enqueued, however, this constraints may hinder the future developments:

If we want to add a periodical enqueue mechanism, this assumption will break

If we want to support some feature like Cancel, this assumption will break

When we support parallel backups, this assumption will break

This mechanism can only work with backup/restore CR in the Velero server pod, for the CRs in the node-agent, this assumption is not true as node-agent is a daemonset

Generally, this change is too delicate and unless necessary, we should avoid to add the change. And consider the problem we are trying to solve, I think the retry mechanism is enough as a short term fix; for the long term, we should develop the mechanism to work with backup window, the current change won't make the ultimate fix.

"When we support parallel backups, this assumption will break"
Actually, this shouldn't break it. With parallel backups, InProgress backups already being handled by controller threads won't be reconciled again.

The retry is a good option for temporary APIServer outages (which was what originally motivated this). However, we still have a somewhat significant bug here if the APIServer is gone for an extended period of time. We end up with backups and restores (and yes, also DU/DD CRs) in an invalid state, listed as "InProgress" but which will never be touched again as long as velero stays up. The only workaround is to force-restart the Velero pod, which can cause ongoing opertations to fail.

Yes, I agree we still have limitation if only supporting the retry mechanism. But as listed above, the current change is not a ultimate solution/universal solution for a controllers.
If this change does everything good, we can include it. However, I think this change adding a new path in the state machine of the reconciler, which is not a good design pattern and will hinder the future development/evolvement. Specifically, when we adding a new feature/change, we have to consider this new path, e.g., this change may be broken when we support cancel, parallel backups, etc.

I think we don't need to go into the details for the cancel or parallel backup because we don't have it for now. But generally speaking, with this change, we must keep one more rule in our mind --- the InProgress CR cannot be reconciled again, otherwise, it will be marked as failed. This will deprive our choices when we develop new features.

With the latest changes in https://github.com/openshift/velero/pull/324/files I believe we have addressed ability to cancel with the use of backupTracker/restoreTracker

kaovilai · 2024-07-17T00:45:37Z

Closing as we prefer simpler retry approach which would not be as complicated for future reconciler work like requeue.

kaovilai mentioned this pull request Jun 5, 2024

Retry backup/restore completion/finalizing status patching to unstuck inprogress backups/restores #7845

Closed

3 tasks

kaovilai commented Jun 5, 2024

View reviewed changes

pkg/controller/backup_controller.go Outdated Show resolved Hide resolved

kaovilai force-pushed the requeue&fail branch from 2951501 to b17981a Compare June 5, 2024 20:38

sseago reviewed Jun 5, 2024

View reviewed changes

pkg/controller/backup_controller.go Show resolved Hide resolved

pkg/controller/backup_controller.go Show resolved Hide resolved

kaovilai force-pushed the requeue&fail branch 3 times, most recently from 45111eb to 91b2a54 Compare June 5, 2024 20:57

kaovilai requested a review from sseago June 5, 2024 21:04

kaovilai commented Jun 6, 2024

View reviewed changes

pkg/controller/backup_controller.go Outdated Show resolved Hide resolved

kaovilai mentioned this pull request Jun 6, 2024

Mark InProgress DataDownload/Upload as failed when status patch fails upon requeuing #7864

Open

kaovilai force-pushed the requeue&fail branch from 91b2a54 to c0fe34f Compare June 6, 2024 20:54

kaovilai force-pushed the requeue&fail branch 2 times, most recently from 97b85e8 to c6aa114 Compare June 6, 2024 21:12

github-actions bot added the has-changelog label Jun 6, 2024

Mark InProgress backup/restore as failed upon requeuing

5e52668

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai force-pushed the requeue&fail branch from c6aa114 to 5e52668 Compare June 6, 2024 21:13

kaovilai commented Jun 6, 2024

View reviewed changes

kaovilai commented Jun 10, 2024

View reviewed changes

Lyndon-Li reviewed Jun 12, 2024

View reviewed changes

kaovilai mentioned this pull request Jun 17, 2024

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing openshift/velero#315

Merged

3 tasks

kaovilai closed this Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark InProgress backup/restore as failed upon requeuing #7863

Mark InProgress backup/restore as failed upon requeuing #7863

kaovilai commented Jun 5, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

kaovilai left a comment

kaovilai Jun 10, 2024

sseago Jun 10, 2024

kaovilai Jun 10, 2024

sseago Jun 10, 2024

kaovilai Jun 10, 2024

Lyndon-Li Jun 12, 2024

kaovilai Jun 17, 2024

sseago Jun 17, 2024

Lyndon-Li Jun 12, 2024

sseago Jun 17, 2024

sseago Jun 17, 2024

Lyndon-Li Jun 18, 2024 •

edited

Loading

Lyndon-Li Jun 18, 2024

kaovilai Jun 25, 2024

kaovilai commented Jul 17, 2024

		// return the error so the status can be re-processed; it's currently still not completed or failed
		return ctrl.Result{}, err

Mark InProgress backup/restore as failed upon requeuing #7863

Mark InProgress backup/restore as failed upon requeuing #7863

Conversation

kaovilai commented Jun 5, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

codecov bot commented Jun 6, 2024 • edited Loading

Codecov Report

kaovilai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lyndon-Li Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaovilai commented Jul 17, 2024

kaovilai commented Jun 5, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

Lyndon-Li Jun 18, 2024 •

edited

Loading