Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Merged
merged 1 commit into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelogs/unreleased/7445-allenxu404
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Adjust the logic for the backup_last_status metrics to stop incorrectly incrementing over time
2 changes: 1 addition & 1 deletion pkg/metrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,7 @@ func (m *ServerMetrics) InitSchedule(scheduleName string) {
c.WithLabelValues(scheduleName).Add(0)
}
if c, ok := m.metrics[backupLastStatus].(*prometheus.GaugeVec); ok {
c.WithLabelValues(scheduleName).Add(1)
c.WithLabelValues(scheduleName).Set(float64(1))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this PR when trying to figure out why the metric was increasing; if I understand correctly, this ends up running periodically (every time the schedule is reconciled), right?

If so, wouldn't this incorrectly clear a failing status? Say that the schedule created a backup that failed, the metric would become 0, and then get reset back to 1 after a couple of minutes, which seems problematic since this can hide failures.

If anything, I think lines 470-471 should be omitted entirely. This would prevent from emitting a data point until the point where we actually have one, which should still fix what #6838 was aiming for. If an accurate status must be set, it should be done as part of scheduleReconciler.Reconcile, using the status of the last backup created by the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Snaipe yes, you're right.

This PR only aimed to address the specific bug caused in #6838. I will create a separate PR to resolve the issue with incorrect failure status clear and keep you posted. Thanks for highlighting this larger problem!

The purpose of setting the default value of backup_last_status to 1 was to prevent mistaken failure reports that could occur if the code exits without updating that metric. I prefer to leave that implementation unchanged. Upon further analysis, I think the true underlying issue seems to be that metrics are re-initialized each time the schedule is reconciled, instead, the metrics should be initialized only if a new backup is due to be created. So I think the InitSchedulefunction would be better placed below, where backups are created, to avoid unnecessary reset. what's your thought?

       if c.ifDue(schedule, cronSchedule) && !c.checkIfBackupInNewOrProgress(schedule) {
		if err := c.submitBackup(ctx, schedule); err != nil {
			return ctrl.Result{}, errors.Wrapf(err, "error submit backup for schedule %s", req.String())
		}
		c.metrics.InitSchedule(schedule.Name)
	}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd help, but I'm not sure it would fix the underlying problem. As far as I can tell, what this would do is that if your schedule is failing, and a new backup is created, then the schedule would be reported as "fixed" metrics-wise for the entire duration of the backup.

It's less bad than the first situation, but it would definitely confuse alerting.

I think not advertising a value for the metric up until the point where there is a data point seems slightly better in the sense that users can choose to ignore them or set it to the default value of their choice (0 or 1) in grafana/alertmanager. A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

I agree, will raise another issue to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we've already has the similar issue to track: #6936

}
if c, ok := m.metrics[restoreAttemptTotal].(*prometheus.CounterVec); ok {
c.WithLabelValues(scheduleName).Add(0)
Expand Down
Loading