Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Merged
merged 1 commit into from
Feb 21, 2024

Conversation

allenxu404
Copy link
Contributor

@allenxu404 allenxu404 commented Feb 20, 2024

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7378

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@allenxu404 allenxu404 changed the title Adjust the logic for the backup_last_status metric to stop incorrectl… Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time Feb 20, 2024
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d45b313) 61.64% compared to head (84fb88c) 61.64%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7445   +/-   ##
=======================================
  Coverage   61.64%   61.64%           
=======================================
  Files         262      262           
  Lines       28477    28477           
=======================================
  Hits        17556    17556           
  Misses       9689     9689           
  Partials     1232     1232           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ly incrementing over time

Signed-off-by: allenxu404 <qix2@vmware.com>
@blackpiglet blackpiglet merged commit 2a1ae0e into vmware-tanzu:main Feb 21, 2024
66 checks passed
@@ -468,7 +468,7 @@ func (m *ServerMetrics) InitSchedule(scheduleName string) {
c.WithLabelValues(scheduleName).Add(0)
}
if c, ok := m.metrics[backupLastStatus].(*prometheus.GaugeVec); ok {
c.WithLabelValues(scheduleName).Add(1)
c.WithLabelValues(scheduleName).Set(float64(1))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this PR when trying to figure out why the metric was increasing; if I understand correctly, this ends up running periodically (every time the schedule is reconciled), right?

If so, wouldn't this incorrectly clear a failing status? Say that the schedule created a backup that failed, the metric would become 0, and then get reset back to 1 after a couple of minutes, which seems problematic since this can hide failures.

If anything, I think lines 470-471 should be omitted entirely. This would prevent from emitting a data point until the point where we actually have one, which should still fix what #6838 was aiming for. If an accurate status must be set, it should be done as part of scheduleReconciler.Reconcile, using the status of the last backup created by the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Snaipe yes, you're right.

This PR only aimed to address the specific bug caused in #6838. I will create a separate PR to resolve the issue with incorrect failure status clear and keep you posted. Thanks for highlighting this larger problem!

The purpose of setting the default value of backup_last_status to 1 was to prevent mistaken failure reports that could occur if the code exits without updating that metric. I prefer to leave that implementation unchanged. Upon further analysis, I think the true underlying issue seems to be that metrics are re-initialized each time the schedule is reconciled, instead, the metrics should be initialized only if a new backup is due to be created. So I think the InitSchedulefunction would be better placed below, where backups are created, to avoid unnecessary reset. what's your thought?

       if c.ifDue(schedule, cronSchedule) && !c.checkIfBackupInNewOrProgress(schedule) {
		if err := c.submitBackup(ctx, schedule); err != nil {
			return ctrl.Result{}, errors.Wrapf(err, "error submit backup for schedule %s", req.String())
		}
		c.metrics.InitSchedule(schedule.Name)
	}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd help, but I'm not sure it would fix the underlying problem. As far as I can tell, what this would do is that if your schedule is failing, and a new backup is created, then the schedule would be reported as "fixed" metrics-wise for the entire duration of the backup.

It's less bad than the first situation, but it would definitely confuse alerting.

I think not advertising a value for the metric up until the point where there is a data point seems slightly better in the sense that users can choose to ignore them or set it to the default value of their choice (0 or 1) in grafana/alertmanager. A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

I agree, will raise another issue to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we've already has the similar issue to track: #6936

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v1.13.0 changes velero_backup_last_status metric
4 participants