-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439
Conversation
f36fc66
to
edff79a
Compare
Sorry for not being constructive in this feedback, but1 I had to read the code to understand what "split metrics by a further label" means. I'm not sure that "Split" is the best word, although I can't come up with a better name. "Label" as a verb would be a proper term, but it also wouldn't help understanding the feature IMO. Footnotes
|
Do we need a documentation issue for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s get closer to clarity together. Blocking for now so we can throw some more ideas back and forth.
e50b34c
to
81b9d35
Compare
8520037
to
6a415da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please list this feature as experimental in the list at docs/sources/operators-guide/configure/about-versioning.md
and also add a CHANGELOG entry for this [FEATURE]
mentioning it's experimental.
6a415da
to
82f4ea5
Compare
131fcb5
to
7a479b0
Compare
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase |
a5fb850
to
e760f5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great job! Thank you for addressing all my feedback.
I think this is still missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! Overall LGTM. I left many minor comments. I think I found an issue around locking in ActiveGroupsCleanupService.iteration()
(see comment there). In few cases I left comments about stuff that I would like to see in a follow up PR.
pkg/mimir/mimir.go
Outdated
@@ -98,6 +98,7 @@ type Config struct { | |||
MultitenancyEnabled bool `yaml:"multitenancy_enabled"` | |||
NoAuthTenant string `yaml:"no_auth_tenant" category:"advanced"` | |||
ShutdownDelay time.Duration `yaml:"shutdown_delay" category:"experimental"` | |||
MaxGroupsPerUser int `yaml:"max_groups_per_user" category:"experimental"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking because marked as experimental and we can move it around] It's a bit an anti-pattern defining it here. Thinking about a more actionable feedback to give you.
if s.activeGroups.ActiveGroupLimitExceeded(user, group) { | ||
group = "other" | ||
} | ||
s.activeGroups.UpdateGroupTimestampForUser(user, group, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not this PR, but for a follow up PR] We need to take the lock and lookup the map twice. It would be reduced to 1 if we moved the limit logic to UpdateGroupTimestampForUser()
(e.g. passing the limit in input). If the limit is reached, it could return a specific error which we then check here.
pkg/util/active_groups.go
Outdated
@@ -91,11 +84,31 @@ func (ag *ActiveGroups) PurgeInactiveGroupsForUser(userID string, deadline int64 | |||
return deletedGroups | |||
} | |||
|
|||
func (ag *ActiveGroups) PurgeInactiveGroups(inactiveTimeout time.Duration, cleanupFuncs ...func(string, string)) { | |||
userIDs := make([]string, len(ag.timestampsPerUser)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't access ag.timestampsPerUser
outside the lock. Please move it after ag.mu.RLock()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to simplify the code, initialise the capacity, not the length, and then just use userIDs = append(userIDs, userID)
. This removes the need of keeping i
.
pkg/util/active_groups.go
Outdated
ag.mu.RUnlock() | ||
|
||
for _, userID := range userIDs { | ||
inactiveGroups := ag.PurgeInactiveGroupsForUser(userID, time.Now().Add(-inactiveTimeout).UnixNano()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please compute time.Now().Add(-inactiveTimeout).UnixNano()
just once, before the for
loop and then just reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for patiently address my feedback! Leaving here a list of unaddressed comments we agreed to work on in a follow up PR:
- Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
- Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
- Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
- Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
- Add additional label to cortex_discarded_samples_total with the label value defined by a config flag #3439 (comment)
What this PR does
Gives the ability to specify a config flag:
validation.separate-metrics-label
which will add an additional labelgroup
tocortex_discarded_samples_total
to take the value of a label on an incoming timeseries. This works similar toHATracker
and thecluster
label. Default value of this config flag is""
which means when this flag is not set, it should get dropped when scraped by Prometheus.There is a limit to the number of active groups which is set by by the
-max-groups-per-user
flag. If a new group is received whilst the group limit has been reached, the value will be changed to "other". Inactive groups are cleaned up every minute, and a group is considered inactive if it has not been updated in 20 minutes.The counters that this affects are as follows:
Distributor
Ingester Metrics
Validate
For the above counters, the deletion of these metrics has been updated to
discardedCounter.DeletePartialMatch(userID)
This builds upon the work started by @LeviHarrison (https://github.com/LeviHarrison) in: #2702
Which issue(s) this PR fixes or relates to
#2420
Fixes #2420
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]