Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk usage metrics for containerd #2785

Open
ribbybibby opened this issue Jan 11, 2021 · 36 comments
Open

Disk usage metrics for containerd #2785

ribbybibby opened this issue Jan 11, 2021 · 36 comments

Comments

@ribbybibby
Copy link

When switching from docker to containerd as my container runtime in Kubernetes, I noticed that container_fs_usage_bytes metrics were no longer being exported for my containers.

It looks like disk usage metrics aren't implemented for containerd, as noted by this comment: https://github.com/google/cadvisor/blob/v0.38.6/container/containerd/handler.go#L164-L165.

Disk usage is a pretty important metric to monitor, so I think, if possible, this should be added.

bobheadxi added a commit to sourcegraph/sourcegraph-public-snapshot that referenced this issue Jan 13, 2021
Remove container fs inodes: disk metrics are not supported in OCI it seems (google/cadvisor#2785), and the metrics it reports in docker-compose feels rather dubious at times. Instead, make ContainerIOUsage a shared observable, and the services that had relevant uses for the inodes monitoring now have this instead.

Reworked container restart: use cAdvisor metrics to detect container restarts in all environments

cAdvisor and monitoring documentation: inline documentation improvements and a new cAdvisor page in the docsite

Shared Group titles: titles are now in `shared` package for consistency and ease of editing
@elcomtik
Copy link

elcomtik commented Feb 1, 2021

I have experienced the same issue.

It looks like disk usage metrics aren't implemented for containerd, as noted by this comment: https://github.com/google/cadvisor/blob/v0.38.6/container/containerd/handler.go#L164-L165.

There is some conversation about these metrics containerd/containerd#678. I suppose that contained provide this information.

@yyrdl
Copy link

yyrdl commented May 19, 2021

PR was submited . #2872

@jepio
Copy link
Contributor

jepio commented Jan 24, 2022

PR #2872 was closed, in favor of #2956, which was merged and subsequently reverted in #2964. The result is that these metrics are not available.

Is someone working on an alternative approach?

@sarbajitc
Copy link

Is there any timeline for a fix of this issue?

@baasumo
Copy link

baasumo commented May 2, 2022

+1 on looking for any update or timeline regarding this issue - these metrics are pretty important for observability and workload behavior.

@fernandesnikhil
Copy link

Adding onto this ticket since we're blocked on switching to the containerd CRI without these metrics. We have alerting around ephemeral file system usage that would break if cAdvisor doesn't collect these from containerd.

@snuggie12
Copy link

@bobbypage Do you have an update on this? Best I can follow is that there is a possibly-working version in the containerd-cri branch after #2966 was merged. However, it might be incomplete based on #2936 (comment)?

Alternatively it seems like work has gone into not using cadvisor for container stats and k8s 1.23 has an alpha feature-gate which uses the cri stats provider (PodAndContainerStatsFromCRI). Is the plan to put momentum into that instead? If so, do you know when it would go beta?

@brandond
Copy link

brandond commented May 25, 2022

Enabling the PodAndContainerStatsFromCRI feature-gate does not seem to work either; at least with containerd 1.6.4 the stats are still missing.

@brandond
Copy link

brandond commented Aug 4, 2022

It appears that this won't be addressed any time soon, as KEP-2371 moves most of the stats collection out of cadvisor into the CRI interface. Is there an interim solution for users that need these stats?

@bobbypage
Copy link
Collaborator

bobbypage commented Aug 4, 2022

The workaround for now is to use the containerd-cri branch (https://github.com/google/cadvisor/tree/containerd-cri) which has a special patch to export containerd disk metrics. The following image can be used: gcr.io/cadvisor/cadvisor:v0.45.0-containerd-cri which is built from that branch and contains the patch.

@brandond
Copy link

brandond commented Aug 4, 2022

Is that branch being actively maintained? Do you know if it still works normally with other runtimes?

@bobbypage
Copy link
Collaborator

bobbypage commented Aug 4, 2022

Is that branch being actively maintained?

Yes, it is maintained we just pushed the latest v0.45.0 changes to this branch. The reason we need this separate branch is because to get the Disk usage metrics on containerd requires importing the CRI API into cAdvisor. However, we can't import the CRI API into cAdvisor because cAdvisor is imported by k8s and k8s itself includes the CRI API which results in a circular dependency. So the workaround for now is to have this separate branch which includes CRI API. (see #2872 (comment) for that discussion).

Do you know if it still works normally with other runtimes?

Yes, it will work with other runtimes as well, but if containerd is not used there is no benefit of using it.

@brandond
Copy link

brandond commented Aug 4, 2022

Ah hmm. So I take it the circular dependency prohibits this branch from being embedded in the kubelet, and there's no easy path towards doing so? Running a standalone deployment of cadvisor isn't particularly palatable, as asking our users to retool their monitoring stacks to make use of that would be a non-trivial amount of work. I'm honestly surprised that we got this far into the dockershim depreciation with cadvisor missing feature parity for one of the most popular replacement runtimes.

@bobbypage
Copy link
Collaborator

@brandond are you referring that most folks are using the existing /cadvisor/metrics endpoint on kubelet? If that's the case, then yes, unfortunately we aren't able to bring back this patch into kubelet due to circular dependency issue. The KEP https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2371-cri-pod-container-stats/README.md aims to solve this issue long term.

@snuggie12
Copy link

Would it be possible to list a config somewhere that only gathers the disk? GKE controls our normal cadvisor so running a minimal "container disk metrics only" daemonset seems like a simple work around.

@yvespp
Copy link

yvespp commented Oct 10, 2022

kubectl get --raw "/api/v1/nodes/(node-name)/proxy/stats/summary" (from kublet) gives infos for ephemeral-storage for each pod but sadly it's not available as a Prometheus metric...

@george-angel
Copy link
Contributor

george-angel commented Feb 20, 2023

Refreshing my memory on this issue, I realised we didn't link to the exporter @ribbybibby written to address this: https://github.com/utilitywarehouse/kube-summary-exporter. We have been running it for nearly 2 years now.

@markrity
Copy link

markrity commented May 4, 2023

Any updates on this ? containerd is the default and recommended runtime for GKE , but there is still no support for kubernetes_filesystem_usage ?

@sidewinder12s
Copy link

sidewinder12s commented May 24, 2023

It appears at least on containerd://1.6.6 and the v0.45.0-containerd-cri tag, the `container_fs_* metrics are also just wrong.

container_fs_usage_bytes at least seems to be reporting the root device free space for every pod on the node as opposed to each containers/pods usage. Does anyone have a reference deployment manifest to use for containerd + that containerd-cri tag?

acumino added a commit to acumino/gardener that referenced this issue Aug 4, 2023
acumino added a commit to acumino/gardener that referenced this issue Aug 4, 2023
acumino added a commit to acumino/gardener that referenced this issue Aug 8, 2023
acumino added a commit to acumino/gardener that referenced this issue Aug 9, 2023
@dragosrosculete
Copy link

Any hope for this to be implemented soon ?

gardener-prow bot pushed a commit to gardener/gardener that referenced this issue Aug 10, 2023
* Add dashboards

* Introduce new value `IsGardenCluster`

* Add dashboard providers configmap

* Add datasource configMap

* Add service

* Add dashboard configMaps

* Add deployment

* Add ingress

* Move helper function at the end

* Deploy oidc dashboard only if authentication webhook is enabled

* Integrate plutono in gop flow

* Adapt seed plutono

* Adapt shoot plutono

* Integrate vali

* Adapt test

* Adapt integration and e2e test

* --------------Empty separator commit---------------

* Reuse  dashboard among shoot and garden

* Change datasource name from `cluster-prometheus` to `prometheus`

Update plutono.go

* Adapt apiserver-overview dashboard to make it reusable.

Rename dashboard variable "apiserver" to "pod"

Add 2 variables: job and pod

Add the pod variable to the promql expressions

* Reuse `apiserver overview` dashboard

* Reuse `apiserver` related other dashboard

* make default selection all

* Reuse apiserver-request-duration-and-response dashboard

Old shoot dashboard had some random stuff also

* Add pod logs to kubernetes pods dashbboard

* Remove Pod file system usage metrics

ref - google/cadvisor#2785

* Adapt PC doc

* Address review

* Use same port for all use case

* Drop special handling for OIDC webhook

* Allow garden dashboard to have additional dashboards

* Adapt test

* Use wildcard cert for ingress in runtime cluster

* Address review

* Address review

* Update docs/usage/trusted-tls-for-garden-runtime.md

* Update docs/README.md

---------

Co-authored-by: Tim Usner <tim.usner@sap.com>
briantopping pushed a commit to briantopping/gardener that referenced this issue Aug 22, 2023
* Add dashboards

* Introduce new value `IsGardenCluster`

* Add dashboard providers configmap

* Add datasource configMap

* Add service

* Add dashboard configMaps

* Add deployment

* Add ingress

* Move helper function at the end

* Deploy oidc dashboard only if authentication webhook is enabled

* Integrate plutono in gop flow

* Adapt seed plutono

* Adapt shoot plutono

* Integrate vali

* Adapt test

* Adapt integration and e2e test

* --------------Empty separator commit---------------

* Reuse  dashboard among shoot and garden

* Change datasource name from `cluster-prometheus` to `prometheus`

Update plutono.go

* Adapt apiserver-overview dashboard to make it reusable.

Rename dashboard variable "apiserver" to "pod"

Add 2 variables: job and pod

Add the pod variable to the promql expressions

* Reuse `apiserver overview` dashboard

* Reuse `apiserver` related other dashboard

* make default selection all

* Reuse apiserver-request-duration-and-response dashboard

Old shoot dashboard had some random stuff also

* Add pod logs to kubernetes pods dashbboard

* Remove Pod file system usage metrics

ref - google/cadvisor#2785

* Adapt PC doc

* Address review

* Use same port for all use case

* Drop special handling for OIDC webhook

* Allow garden dashboard to have additional dashboards

* Adapt test

* Use wildcard cert for ingress in runtime cluster

* Address review

* Address review

* Update docs/usage/trusted-tls-for-garden-runtime.md

* Update docs/README.md

---------

Co-authored-by: Tim Usner <tim.usner@sap.com>
nickytd pushed a commit to nickytd/gardener that referenced this issue Sep 11, 2023
* Add dashboards

* Introduce new value `IsGardenCluster`

* Add dashboard providers configmap

* Add datasource configMap

* Add service

* Add dashboard configMaps

* Add deployment

* Add ingress

* Move helper function at the end

* Deploy oidc dashboard only if authentication webhook is enabled

* Integrate plutono in gop flow

* Adapt seed plutono

* Adapt shoot plutono

* Integrate vali

* Adapt test

* Adapt integration and e2e test

* --------------Empty separator commit---------------

* Reuse  dashboard among shoot and garden

* Change datasource name from `cluster-prometheus` to `prometheus`

Update plutono.go

* Adapt apiserver-overview dashboard to make it reusable.

Rename dashboard variable "apiserver" to "pod"

Add 2 variables: job and pod

Add the pod variable to the promql expressions

* Reuse `apiserver overview` dashboard

* Reuse `apiserver` related other dashboard

* make default selection all

* Reuse apiserver-request-duration-and-response dashboard

Old shoot dashboard had some random stuff also

* Add pod logs to kubernetes pods dashbboard

* Remove Pod file system usage metrics

ref - google/cadvisor#2785

* Adapt PC doc

* Address review

* Use same port for all use case

* Drop special handling for OIDC webhook

* Allow garden dashboard to have additional dashboards

* Adapt test

* Use wildcard cert for ingress in runtime cluster

* Address review

* Address review

* Update docs/usage/trusted-tls-for-garden-runtime.md

* Update docs/README.md

---------

Co-authored-by: Tim Usner <tim.usner@sap.com>
@smileusd
Copy link

Any updates?

@mikkeloscar
Copy link
Contributor

I tried to rebase (https://github.com/google/cadvisor/tree/containerd-cri) on v0.48.0 (and v0.47.1) in both cases the resource usage blows up: 🙁

image

I do see values for the metrics, but didn't validate that they are correct as is reported not to be in other comments.

@brandond
Copy link

brandond commented Nov 28, 2023

I doubt this is going to be fixed, given the work in progress to move stats into the CRI API, and use the CRI stats to replace the data currently served at the cadvisor metrics endpoint - as discussed above.

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2371-cri-pod-container-stats/README.md#cadvisor-less-cri-full-container-and-pod-stats

  • enhance the CRI API with enough metrics to be able to supplement the pod and container fields in the summary API directly from CRI.
  • enhance the CRI implementations to broadcast the required metrics to fulfill the pod and container fields in the /metrics/cadvisor endpoint.

It looks like containerd's cgroupv2 manager does not currently support filesystem utilization stats; it only returns data for PIDs, CPU, memory, block IO, RDMA, and HugeTLB.
https://github.com/containerd/cgroups/blob/v3.0.2/cgroup2/stats/metrics.pb.go#L28-L34

@gravese
Copy link

gravese commented Jan 25, 2024

Is Have Any updates?

@ning1875
Copy link

use the crictl tool can get container fs usage ,eg

 crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
0674440a33dbd       0.00                1.438MB             102.4kB             24
2e2f101e7ce72       0.06                62.43MB             114.7kB             29
37ed67b1e33cf       1.58                346.1MB             110.6MB             41

and the data show in DISK row come from this code
crictl called cadvisor ListContainerStats api with grpc
and the response has a key named WritableLayer mean container fs usage

type ContainerStats struct {
	// Information of the container.
	Attributes *ContainerAttributes `protobuf:"bytes,1,opt,name=attributes,proto3" json:"attributes,omitempty"`
	// CPU usage gathered from the container.
	Cpu *CpuUsage `protobuf:"bytes,2,opt,name=cpu,proto3" json:"cpu,omitempty"`
	// Memory usage gathered from the container.
	Memory *MemoryUsage `protobuf:"bytes,3,opt,name=memory,proto3" json:"memory,omitempty"`
	// Usage of the writable layer.
	WritableLayer *FilesystemUsage `protobuf:"bytes,4,opt,name=writable_layer,json=writableLayer,proto3" json:"writable_layer,omitempty"`
	// Swap usage gathered from the container.
	Swap                 *SwapUsage `protobuf:"bytes,5,opt,name=swap,proto3" json:"swap,omitempty"`
	XXX_NoUnkeyedLiteral struct{}   `json:"-"`
	XXX_sizecache        int32      `json:"-"`
}

so containerd has ability to get container fs usage ,but why cadvisor not call this ListContainerStats api?

@brandond
Copy link

brandond commented Feb 21, 2024

I believe that is gated on the PodAndContainerStatsFromCRI FeatureGate, which is still alpha?

Have you tried enabling it on your node?

@george-angel
Copy link
Contributor

It looks like that breaks other things: kubernetes/kubernetes#111276

@wolgod
Copy link

wolgod commented Apr 29, 2024

+1 Is there any solution now?

@changhyuni
Copy link

What happened to the usage metric?
Where should I check?

@robini
Copy link

robini commented Jul 15, 2024

Is there any solution to find disk usage metrics for containerd via prometheus ?

@mindw
Copy link

mindw commented Aug 27, 2024

@robini - quoting @george-angel :

Refreshing my memory on this issue, I realised we didn't link to the exporter @ribbybibby written to address this: https://github.com/utilitywarehouse/kube-summary-exporter. We have been running it for nearly 2 years now.

@sidewinder12s
Copy link

Yep also been running that for about a year.

@luarx
Copy link

luarx commented Sep 18, 2024

Is there any timeline for a fix of this issue? 🙏
We need to monitor the storage used by container (container_fs_usage_bytes) and not only the ephemeral storage because we are not always using ephemeral storages

@vishiy
Copy link

vishiy commented Oct 30, 2024

Hi all - when can we expect fix for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests