backports/v1.0: Add a metric to provide per-event missed events #1702
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[upstream commit d5a7ee2]
Example:
$ curl localhost:2112/metrics 2> /dev/null | grep 'sent_events_total|missed_events_total|ringbuf_perf_event_lost_total|ringbuf_queue_lost_total|msg_op_total|ringbuf_queue_received_total' tetragon_missed_events_total{msg_op="13"} 73300
tetragon_missed_events_total{msg_op="23"} 28
tetragon_missed_events_total{msg_op="24"} 606
tetragon_missed_events_total{msg_op="5"} 20
tetragon_missed_events_total{msg_op="7"} 22
tetragon_msg_op_total{msg_op="13"} 4.268532e+06
tetragon_msg_op_total{msg_op="23"} 12444
tetragon_msg_op_total{msg_op="24"} 2110
tetragon_msg_op_total{msg_op="5"} 11908
tetragon_msg_op_total{msg_op="7"} 12447
tetragon_ringbuf_perf_event_lost_total 73976
tetragon_ringbuf_queue_lost_total 0
tetragon_ringbuf_queue_received_total 4.307441e+06
This PR adds an eBPF map collector for getting metrics directly from a map. This map contains information about the return values of all perf_event_output calls (i.e. if it fails). This provides us the ability to determine missed events per type. Metric tetragon_missed_events_total contains such information.
Using the previous example, we can see that we lost 73976 events from the user-space (tetragon_ringbuf_perf_event_lost_total). This is the same as the sum of all tetragon_missed_events_total metrics gathered from the kernel.