Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark InProgress backup/restore as failed upon requeuing #7863

Closed
wants to merge 1 commit into from

Conversation

kaovilai
Copy link
Contributor

@kaovilai kaovilai commented Jun 5, 2024

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7207

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@kaovilai kaovilai force-pushed the requeue&fail branch 3 times, most recently from 45111eb to 91b2a54 Compare June 5, 2024 20:57
@kaovilai kaovilai requested a review from sseago June 5, 2024 21:04
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 58.78%. Comparing base (a8d77ea) to head (5e52668).

Files Patch % Lines
pkg/controller/restore_controller.go 52.38% 9 Missing and 1 partial ⚠️
pkg/controller/backup_controller.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7863      +/-   ##
==========================================
- Coverage   58.80%   58.78%   -0.02%     
==========================================
  Files         345      345              
  Lines       28759    28777      +18     
==========================================
+ Hits        16911    16917       +6     
- Misses      10420    10431      +11     
- Partials     1428     1429       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kaovilai kaovilai force-pushed the requeue&fail branch 2 times, most recently from 97b85e8 to c6aa114 Compare June 6, 2024 21:12
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Copy link
Contributor Author

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to add unit tests and will mark ready for review.

Comment on lines +292 to +293
// return the error so the status can be re-processed; it's currently still not completed or failed
return ctrl.Result{}, err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we patch completion status on restore. We can store completion in memory so that subsequent requeues patch status to complete rather than to fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in memory or on disk? If in-memory, we just need to be careful about thread issues (to handle future cases where we have multiple reconcilers).

Also, we have the regular controller and the finalizer controller, so two different places/patches to deal with. i.e. which CR and which reconciler/controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalizer controllers won't mark backup/restore as completed right? This memory is limited to Completed phase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the finalizer controller is the only one that marks it completed. Backup/restore controller moves it from InProgress to Finalizing. Both of these transitions need requeue-to-fail-or-update though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack Thanks for the catch.

@@ -307,7 +320,10 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
log.Info("Updating backup's final status")
if err := kubeutil.PatchResource(original, request.Backup, b.kbClient); err != nil {
log.WithError(err).Error("error updating backup's final status")
// return the error so the status can be re-processed; it's currently still not completed or failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the CR won't be re-enqueued when returning an error from the reconciler, could you confirm this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had confirmed that returning error would cause reconciler to requeue.

This is equivalent to returning with requeue: true but with more information.

See: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.18.4/pkg/reconcile#Reconciler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	// If the returned error is non-nil, the Result is ignored and the request will be
	// requeued using exponential backoff. The only exception is if the error is a
	// TerminalError in which case no requeuing happens.

@@ -230,6 +230,20 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
switch original.Status.Phase {
case "", velerov1api.BackupPhaseNew:
// only process new backups
case velerov1api.BackupPhaseInProgress:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't add this change, this assumes the inProgress CR won't be re-enqueued, however, this constraints may hinder the future developments:

  1. If we want to add a periodical enqueue mechanism, this assumption will break
  2. If we want to support some feature like Cancel, this assumption will break
  3. When we support parallel backups, this assumption will break
  4. This mechanism can only work with backup/restore CR in the Velero server pod, for the CRs in the node-agent, this assumption is not true as node-agent is a daemonset

Generally, this change is too delicate and unless necessary, we should avoid to add the change. And consider the problem we are trying to solve, I think the retry mechanism is enough as a short term fix; for the long term, we should develop the mechanism to work with backup window, the current change won't make the ultimate fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When we support parallel backups, this assumption will break"
Actually, this shouldn't break it. With parallel backups, InProgress backups already being handled by controller threads won't be reconciled again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry is a good option for temporary APIServer outages (which was what originally motivated this). However, we still have a somewhat significant bug here if the APIServer is gone for an extended period of time. We end up with backups and restores (and yes, also DU/DD CRs) in an invalid state, listed as "InProgress" but which will never be touched again as long as velero stays up. The only workaround is to force-restart the Velero pod, which can cause ongoing opertations to fail.

Copy link
Contributor

@Lyndon-Li Lyndon-Li Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree we still have limitation if only supporting the retry mechanism. But as listed above, the current change is not a ultimate solution/universal solution for a controllers.
If this change does everything good, we can include it. However, I think this change adding a new path in the state machine of the reconciler, which is not a good design pattern and will hinder the future development/evolvement. Specifically, when we adding a new feature/change, we have to consider this new path, e.g., this change may be broken when we support cancel, parallel backups, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to go into the details for the cancel or parallel backup because we don't have it for now. But generally speaking, with this change, we must keep one more rule in our mind --- the InProgress CR cannot be reconciled again, otherwise, it will be marked as failed. This will deprive our choices when we develop new features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest changes in https://github.com/openshift/velero/pull/324/files I believe we have addressed ability to cancel with the use of backupTracker/restoreTracker

@kaovilai
Copy link
Contributor Author

Closing as we prefer simpler retry approach which would not be as complicated for future reconciler work like requeue.

@kaovilai kaovilai closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velero didn't retry on failed Restore CR status update, causing the CR to remain stuck in "InProgress"
3 participants