Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

fayzal-g · 2022-11-11T16:45:30Z

What this PR does

Gives the ability to specify a config flag: validation.separate-metrics-label which will add an additional label group to cortex_discarded_samples_total to take the value of a label on an incoming timeseries. This works similar to HATracker and the cluster label. Default value of this config flag is "" which means when this flag is not set, it should get dropped when scraped by Prometheus.

There is a limit to the number of active groups which is set by by the -max-groups-per-user flag. If a new group is received whilst the group limit has been reached, the value will be changed to "other". Inactive groups are cleaned up every minute, and a group is considered inactive if it has not been updated in 20 minutes.

The counters that this affects are as follows:

Distributor

discardedSamplesTooManyHaClusters
discardedSamplesRateLimited

Ingester Metrics

discardedSamplesSampleOutOfBounds
discardedSamplesSampleOutOfOrder
discardedSamplesSampleTooOld
discardedSamplesNewValueForTimestamp
discardedSamplesPerUserSeriesLimit
discardedSamplesPerMetricSeriesLimit

Validate

missingMetricName
invalidMetricName
maxLabelNamesPerSeries
invalidLabel
labelNameTooLong
labelValueTooLong
duplicateLabelNames
labelsNotSorted
tooFarInFuture

For the above counters, the deletion of these metrics has been updated to discardedCounter.DeletePartialMatch(userID)

This builds upon the work started by @LeviHarrison (https://github.com/LeviHarrison) in: #2702

Which issue(s) this PR fixes or relates to

#2420

Fixes #2420

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/util/validation/validate.go

pkg/util/validation/limits.go

colega · 2022-11-12T11:16:16Z

Sorry for not being constructive in this feedback, but¹ I had to read the code to understand what "split metrics by a further label" means. I'm not sure that "Split" is the best word, although I can't come up with a better name. "Label" as a verb would be a proper term, but it also wouldn't help understanding the feature IMO.

Everyone knows that the important part always comes after the "but". ↩

integration/distributor_validation_split_label_test.go

osg-grafana · 2022-11-21T12:18:15Z

Do we need a documentation issue for this?

osg-grafana

Let’s get closer to clarity together. Blocking for now so we can throw some more ideas back and forth.

osg-grafana

doc is fine

pracucci

Please list this feature as experimental in the list at docs/sources/operators-guide/configure/about-versioning.md and also add a CHANGELOG entry for this [FEATURE] mentioning it's experimental.

pkg/util/validation/validate.go

pkg/util/push/otel.go

pkg/api/api.go

replay · 2022-11-29T14:48:24Z

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

…he mimir level

colega

LGTM! Great job! Thank you for addressing all my feedback.

CHANGELOG.md

pracucci · 2023-01-09T14:03:57Z

Please list this feature as experimental in the list at docs/sources/operators-guide/configure/about-versioning.md

I think this is still missing.

pracucci

Good job! Overall LGTM. I left many minor comments. I think I found an issue around locking in ActiveGroupsCleanupService.iteration() (see comment there). In few cases I left comments about stuff that I would like to see in a follow up PR.

pkg/mimir/mimir.go

pracucci · 2023-01-09T13:24:02Z

pkg/mimir/mimir.go

@@ -98,6 +98,7 @@ type Config struct {
 	MultitenancyEnabled bool                   `yaml:"multitenancy_enabled"`
 	NoAuthTenant        string                 `yaml:"no_auth_tenant" category:"advanced"`
 	ShutdownDelay       time.Duration          `yaml:"shutdown_delay" category:"experimental"`
+	MaxGroupsPerUser    int                    `yaml:"max_groups_per_user" category:"experimental"`


[non blocking because marked as experimental and we can move it around] It's a bit an anti-pattern defining it here. Thinking about a more actionable feedback to give you.

CHANGELOG.md

pkg/util/validation/separate_metrics.go

pkg/util/active_groups.go

pracucci · 2023-01-09T14:26:50Z

pkg/util/active_groups.go

+	if s.activeGroups.ActiveGroupLimitExceeded(user, group) {
+		group = "other"
+	}
+	s.activeGroups.UpdateGroupTimestampForUser(user, group, now)


[not this PR, but for a follow up PR] We need to take the lock and lookup the map twice. It would be reduced to 1 if we moved the limit logic to UpdateGroupTimestampForUser() (e.g. passing the limit in input). If the limit is reached, it could return a specific error which we then check here.

pkg/util/active_groups.go

pkg/util/active_groups_test.go

pracucci · 2023-01-10T13:40:30Z

pkg/util/active_groups.go

@@ -91,11 +84,31 @@ func (ag *ActiveGroups) PurgeInactiveGroupsForUser(userID string, deadline int64
 	return deletedGroups
 }

+func (ag *ActiveGroups) PurgeInactiveGroups(inactiveTimeout time.Duration, cleanupFuncs ...func(string, string)) {
+	userIDs := make([]string, len(ag.timestampsPerUser))


You can't access ag.timestampsPerUser outside the lock. Please move it after ag.mu.RLock().

Also, to simplify the code, initialise the capacity, not the length, and then just use userIDs = append(userIDs, userID). This removes the need of keeping i.

pracucci · 2023-01-10T13:42:39Z

pkg/util/active_groups.go

+	ag.mu.RUnlock()
+
+	for _, userID := range userIDs {
+		inactiveGroups := ag.PurgeInactiveGroupsForUser(userID, time.Now().Add(-inactiveTimeout).UnixNano())


Please compute time.Now().Add(-inactiveTimeout).UnixNano() just once, before the for loop and then just reuse it.

pracucci

Thanks for patiently address my feedback! Leaving here a list of unaddressed comments we agreed to work on in a follow up PR:

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)

fayzal-g force-pushed the separate-metrics branch from f36fc66 to edff79a Compare November 11, 2022 16:49

colega reviewed Nov 12, 2022

View reviewed changes

pkg/util/validation/validate.go Outdated Show resolved Hide resolved

colega reviewed Nov 12, 2022

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

colega reviewed Nov 12, 2022

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

colega reviewed Nov 12, 2022

View reviewed changes

integration/distributor_validation_split_label_test.go Outdated Show resolved Hide resolved

fayzal-g changed the title ~~Split metrics by a further label specified by a config flag~~ Add additional label to certain metrics defined by a config flag Nov 21, 2022

fayzal-g marked this pull request as ready for review November 21, 2022 12:16

fayzal-g requested a review from a team as a code owner November 21, 2022 12:16

osg-grafana suggested changes Nov 21, 2022

View reviewed changes

osg-grafana added the type/docs Improvements or additions to documentation label Nov 21, 2022

fayzal-g marked this pull request as draft November 21, 2022 12:23

fayzal-g force-pushed the separate-metrics branch from e50b34c to 81b9d35 Compare November 21, 2022 16:36

fayzal-g changed the title ~~Add additional label to certain metrics defined by a config flag~~ Add additional label to cortex_discarded_samples_total with the value defined by a config flag Nov 23, 2022

fayzal-g changed the title ~~Add additional label to cortex_discarded_samples_total with the value defined by a config flag~~ Add additional label to cortex_discarded_samples_total with the label value defined by a config flag Nov 23, 2022

fayzal-g force-pushed the separate-metrics branch 2 times, most recently from 8520037 to 6a415da Compare November 24, 2022 10:48

fayzal-g marked this pull request as ready for review November 24, 2022 11:11

fayzal-g requested a review from a team as a code owner November 24, 2022 11:11

fayzal-g requested a review from a team November 24, 2022 11:11

osg-grafana previously approved these changes Nov 24, 2022

View reviewed changes

pracucci reviewed Nov 28, 2022

View reviewed changes

pkg/util/validation/validate.go Outdated Show resolved Hide resolved

fayzal-g marked this pull request as draft November 28, 2022 13:06

fayzal-g force-pushed the separate-metrics branch from 6a415da to 82f4ea5 Compare November 28, 2022 13:07

fayzal-g commented Nov 29, 2022

View reviewed changes

pkg/util/push/otel.go Outdated Show resolved Hide resolved

fayzal-g commented Nov 29, 2022

View reviewed changes

pkg/api/api.go Outdated Show resolved Hide resolved

fayzal-g force-pushed the separate-metrics branch from 131fcb5 to 7a479b0 Compare November 29, 2022 09:56

fayzal-g added 15 commits January 6, 2023 10:20

Update docs

35cdbc2

Address final comments

fe809c5

More tests

c10e874

Additional comments

45462c0

Documentation update and changelog

b793c50

Update reference and docs

0b4b80b

Address comments, next step to move active group cleanup service to t…

6ed10f7

…he mimir level

Single active groups cleanup service for both distirbutor and ingester

0f39d9d

Update docs and tests

99facea

Rebase

037470d

Fix final test

8513f91

Update reference

fe8acc6

Update active group on request, not just error

259b03e

Change function signature to receive a time.Time

f3ca238

Address final comments

e760f5e

fayzal-g force-pushed the separate-metrics branch from a5fb850 to e760f5e Compare January 6, 2023 12:33

Improve naming and documentation

c816bc9

colega approved these changes Jan 6, 2023

View reviewed changes

pracucci reviewed Jan 6, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update changelog

aacf36a

pracucci self-requested a review January 9, 2023 13:16

pracucci reviewed Jan 9, 2023

View reviewed changes

Address comments, update documentation

ef68dba

pracucci reviewed Jan 10, 2023

View reviewed changes

Lock safety

5797ba4

pracucci approved these changes Jan 10, 2023

View reviewed changes

pracucci merged commit f16146b into grafana:main Jan 10, 2023

fayzal-g mentioned this pull request Jan 10, 2023

Refactor: Clean up tests and make active group purging more efficient #3917

Merged

3 tasks

pstibrany mentioned this pull request Jan 11, 2023

Optionally separate metrics by group #2702

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

fayzal-g commented Nov 11, 2022 •

edited

Loading

colega commented Nov 12, 2022

osg-grafana commented Nov 21, 2022

osg-grafana left a comment

osg-grafana left a comment

pracucci left a comment

replay commented Nov 29, 2022

colega left a comment

pracucci commented Jan 9, 2023

pracucci left a comment

pracucci Jan 9, 2023

pracucci Jan 9, 2023

pracucci Jan 10, 2023

pracucci Jan 10, 2023

pracucci Jan 10, 2023

pracucci left a comment

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439

Conversation

fayzal-g commented Nov 11, 2022 • edited Loading

What this PR does

Distributor

Ingester Metrics

Validate

Which issue(s) this PR fixes or relates to

Checklist

colega commented Nov 12, 2022

Footnotes

osg-grafana commented Nov 21, 2022

osg-grafana left a comment

Choose a reason for hiding this comment

osg-grafana left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

replay commented Nov 29, 2022

colega left a comment

Choose a reason for hiding this comment

pracucci commented Jan 9, 2023

pracucci left a comment

Choose a reason for hiding this comment

pracucci Jan 9, 2023

Choose a reason for hiding this comment

pracucci Jan 9, 2023

Choose a reason for hiding this comment

pracucci Jan 10, 2023

Choose a reason for hiding this comment

pracucci Jan 10, 2023

Choose a reason for hiding this comment

pracucci Jan 10, 2023

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

fayzal-g commented Nov 11, 2022 •

edited

Loading