cache: capture metrics related to cache records and pruning #4476

jsternberg · 2023-12-08T16:55:12Z

crazy-max · 2023-12-08T17:20:35Z

I guess this replaces #4464?

jsternberg · 2023-12-08T17:40:24Z

I wasn't aware of that PR but yes I think this would likely replace the need for that PR.

jsternberg · 2023-12-08T19:07:32Z

@crazy-max I took a look at it again and it seems that PR is trying to do something different. It seems like they're trying to find the information about a single build based on the progress writer and output that as JSON. This PR is more about the overall system itself. So this PR would capture how many times and how long we spent in pruning the cache and also would show how many cache entries there are.

tonistiigi · 2023-12-12T01:43:06Z

cache/metrics.go

+	return stats
+}
+
+func (cm *cacheManager) collectMetrics(ctx context.Context, o metric.Observer) error {


When does this get called?

This gets called automatically during a metrics collection interval (defined by the reader). So for the periodic reader, every 60 seconds by default. For the prometheus reader, whenever /metrics is called.

See the invocation of RegisterCallback earlier in this file and the observable gauge.

So for the periodic reader, every 60 seconds by default. For the prometheus reader, whenever /metrics is called.

q: Does this mean that the calls are duplicated because there are 2 readers?

Just checked this. Short answer, yes.

The call happens for each reader so it'll happen each time /metrics is hit and also every 60 seconds.

Alternatively, if the call never happens because the metrics are never checked, it happens zero times. So if the otel collector isn't configured and the prometheus endpoint doesn't exist or isn't invoked.

tonistiigi

It would help to get a simple guide under docs for a workflow using this data. Like we have for tracing https://docs.docker.com/build/building/opentelemetry/ . I guess /metrics side is somewhat more obvious but more interested in the otlp export side (eg. how to see it in grafana or similar).

Question: Do you think it could make sense to move #3860 to use otlp exporters as well and use recorder mechanism like we do for traces to capture the metrics for individual containers. Maybe this would have some builtin logic for combining profiling points or make it easier to visualize. Otoh. atm we have neat typed structures while this is a single string key for each value.

crazy-max · 2023-12-23T16:41:48Z

docker-bake.hcl

+      //{ name = "labs", tags = "dfrunsecurity dfparents", target = "golangci-lint" },
+      //{ name = "nydus", tags = "nydus", target = "golangci-lint" },
+      //{ name = "yaml", tags = "", target = "yamllint" },
+      //{ name = "proto", tags = "", target = "protolint" },


Can be reverted since #4490

Removed. This was for testing and was accidentally committed.

jsternberg · 2023-12-27T20:04:47Z

@tonistiigi yes I think updating the docs would be helpful. I've also been considering adapting the local docker compose I use for development and adapting it to be a little more general just to help facilitate some workflows. I was thinking something like having the compose file launch buildkit while including configuration for the debugger, jaeger (for tracing), grafana (for viewing metrics), and then something to store the metrics outputs. If there's some interest there, I'll try some stuff out.

For the build resource metrics, I do think it likely makes sense to convert some of those to OTLP metrics or to have OTLP metrics be mimicked along with those. I think it might be worth discussing more of what this might look like as I'm not really sure what the best way is.

Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>

crazy-max · 2024-01-03T14:37:46Z

cache/metrics.go

+
+const (
+	instrumentationName      = "github.com/moby/buildkit/cache"
+	metricCacheRecords       = "cache.records.count"


I was wondering if we should have a cache.records.size to collect the cache size or maybe split between cache.records.shared.size, cache.records.private.size, cache.records.reclaimable.size?

I'm not really sure to be honest. I do think that will likely take a bit more effort though to implement so I left it out of this initial version. The reason is because I wasn't quite sure how to determine the disk usage in an efficient way taking into account mutable and immutable. I figured immutable entries didn't need to be continuously updated while mutable ones would have to be rechecked. I also didn't want to hold a lock on the disk usage or perform a potentially expensive computation to count the size.

I do think it's likely worth making a new issue for cache sizes and adding some metrics to it. I'd say all of the above are good metrics.

to hold a lock on the disk usage or perform a potentially expensive computation to count the size.

Oh right that's a very good point, disk usage is indeed a resource intensive call.

jsternberg · 2024-01-04T21:33:07Z

Converting this back to a draft for a little bit. I want to iterate on the format of the metric before this would get merged.

jsternberg force-pushed the otel-cache-manager-metrics branch from a2a194d to 16ac232 Compare December 8, 2023 18:58

tonistiigi reviewed Dec 12, 2023

View reviewed changes

tonistiigi reviewed Dec 23, 2023

View reviewed changes

crazy-max reviewed Dec 23, 2023

View reviewed changes

jsternberg force-pushed the otel-cache-manager-metrics branch from 16ac232 to 7538521 Compare December 27, 2023 20:05

cache: capture metrics related to cache records and pruning

9b98442

Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>

jsternberg force-pushed the otel-cache-manager-metrics branch from 7538521 to 9b98442 Compare December 27, 2023 21:42

crazy-max reviewed Jan 3, 2024

View reviewed changes

jsternberg marked this pull request as draft January 4, 2024 21:32

thompson-shaun added the kind/enhancement label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: capture metrics related to cache records and pruning #4476

cache: capture metrics related to cache records and pruning #4476

jsternberg commented Dec 8, 2023

crazy-max commented Dec 8, 2023

jsternberg commented Dec 8, 2023

jsternberg commented Dec 8, 2023

tonistiigi Dec 12, 2023

jsternberg Dec 12, 2023

tonistiigi Dec 23, 2023

jsternberg Dec 27, 2023

tonistiigi left a comment

crazy-max Dec 23, 2023

jsternberg Dec 27, 2023

jsternberg commented Dec 27, 2023

crazy-max Jan 3, 2024

jsternberg Jan 3, 2024

crazy-max Jan 3, 2024

jsternberg commented Jan 4, 2024

cache: capture metrics related to cache records and pruning #4476

Are you sure you want to change the base?

cache: capture metrics related to cache records and pruning #4476

Conversation

jsternberg commented Dec 8, 2023

crazy-max commented Dec 8, 2023

jsternberg commented Dec 8, 2023

jsternberg commented Dec 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg commented Dec 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg commented Jan 4, 2024