-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize metrics with labels #2162
Conversation
0c08d05
to
5302f27
Compare
// NOTES: | ||
// * Rename process_loader_stats metric (to e.g. process_loader_events_total) and count label (to e.g. event)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to link GH issues here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think so, just I'm not 100% sure which of these points are worth implementing and which not. Many of these are technically breaking changes, but also many metrics are probably used only by Tetragon developers so 🤷♀️ I need one more pass over the metrics and maybe another pair of eyes, then I'll open issues for improvements.
func (e ErrorType) String() string { | ||
return errorTypeLabelValues[e] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a fallback here for missing types to avoid panics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't panic, in case e
is not in errorTypeLabelValues
it will return an empty string (zero value).
} | ||
|
||
func (t CacheEntryType) String() string { | ||
return cacheEntryTypeLabelValues[t] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in errormetrics.
} | ||
|
||
func (t MergeErrorType) String() string { | ||
return mergeErrorTypeLabelValues[t] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as errormetrics.
} | ||
|
||
func (s Subsys) String() string { | ||
return subsysLabelValues[s] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as errormetrics.
} | ||
|
||
func (w Watcher) String() string { | ||
return watcherTypeLabelValues[w] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as errormetrics.
LGTM, I just left a few nits/comments. Feel free to take them or leave them. |
f70a3ee
to
cebc21f
Compare
This commit is a preparation for further metrics improvements. There are no exciting changes here, only adding some comments about metrics naming to track further work. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
The label values are now retrieved via ErrorType.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
This is done so that we can conveniently iterate over all known opcodes. The OpCode.String() method also reads from this map. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Before, error_type label on tetragon_handler_errors_total metric was set to the Go type of the error that occurred. This is suboptimal for a few reasons: * Go type of the error isn't necessarily relevant for the operator. In particular, `errors.errorString` is rather opaque. * Such a label is unstable - developers can simply rename an internal type. * Label values are unbound. This makes it harder to estimate the metric cardinality and initialize it. This commit changes the error_type label to take one of the values defined in the errormetrics package. Technically it is a breaking change. The metrics is marked as "for internal use only", so it should be okay. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_errors_total with all possible errors. Initialize tetragon_handler_errors_total with: * known handler opcodes (excluding test) and "event_handler_failed" error * 0 ("unknown") opcode and "unknown_opcode" error Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce CacheEntryType type and constants representing possible values of entry_type label on tetragon_event_cache_retries_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Instead of using the Go type of the event as event_type label on event cache metrics, use the value defined in the Tetragon API. This is technically a breaking change. However, it makes the labels more stable and consistent across metrics. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Add event_type label to tetragon_event_cache_errors_total, so that errors can be aggregated by event type. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce CacheError type and constants representing possible values of error label on tetragon_event_cache_errors_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_event_cache_retries_total with all possible entry types. Initialize tetragon_event_cache_<entry_type>_errors_total with all possible event types. Initialize tetragon_event_cache_errors_total with all (one) possible errors and event types. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
This is done so that we can conveniently iterate over all common flags. The DecodeCommonFlags() function also reads from this map. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_flags_total with all possible flags. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce MergeErrorType type and constants representing possible values of curr_type and prev_type labels on tetragon_generic_kprobe_merge_errors_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_msg_op_total and tetragon_handling_latency with all possible opcodes. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce Subsys type and constants representing possible values of subsys label on tetragon_policyfilter_metrics_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce Operation type and constants representing possible values of op label on tetragon_policyfilter_metrics_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Remove error label from tetragon_policyfilter_metrics_total metric. This label wasn't adding any useful information, but its values were unbound, making it risky from a cardinality perspective. This also removes direct dependency on github.com/pkg/errors package. Technically it is a breaking change. The metrics is marked as "for internal use only", so it should be okay. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_policyfilter_metrics_total with all possible combinations of subsys and operation. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Metrics defined in the processexecmetrics package are not used anywhere, so let's remove them. They had exec_id as a label - this is risky from a cardinality perspective, so they shouldn't be used in this form anyway. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Introduce Watcher type and constants representing possible values of watcher label on watcher_<errors|events>_total metrics. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_watcher_errors_total and tetragon_watcher_events_total metrics with all (one) possible watchers and (also one) errors. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
The label values are now retrieved via DataEventOp.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_data_events_total with all possible events. Initialize tetragon_data_event_size with all possible operations. It's a histogram with quite large number of buckets (20), but only two possible labels values, so it should be okay. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Initialize tetragon_process_loader_stats with all possible event types. Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
cebc21f
to
b24e3f1
Compare
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
I added a release note to the PR description, covering all changes in metrics & labels. I'm merging this so that #2164 is unblocked. |
So far metrics with labels were registered but not initialized at startup. This leads to unpredictable resources usage and limits certain registry operations, e.g. linting or autogenerating docs.
This PR initializes metrics with known labels. It required explicitly defining known labels, so there are lots of commits doing just that - defining types for label values. Let me know if you want me to squash some of these commits. I also used
InitMetrics
functions as a notebook and left some comments for the future about metrics naming - hope that's ok.A few remarks about the impact and limitations of this approach: