Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manager.go:694] Error getting data for container / because of race condition #3407

Closed
sgpinkus opened this issue Oct 7, 2023 · 3 comments · Fixed by #3412
Closed

manager.go:694] Error getting data for container / because of race condition #3407

sgpinkus opened this issue Oct 7, 2023 · 3 comments · Fixed by #3412

Comments

@sgpinkus
Copy link

sgpinkus commented Oct 7, 2023

Running cadvisor like:

# VERSION=v0.47.2 # use the latest release version from https://github.com/google/cadvisor/releases
sudo docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --privileged \
  --device=/dev/kmsg \
  gcr.io/cadvisor/cadvisor:$VERSION \
  --docker_only=true \
  --disable_root_cgroup_stats=true

Gives error log every 15s:

manager.go:694] Error getting data for container / because of race condition

Setting --disable_root_cgroup_stats=false make this error log go away.

@hhromic
Copy link
Contributor

hhromic commented Oct 10, 2023

This happens to us as well when we disable root cgroup stats.

That log message actually appears when something scrapes the /metrics endpoint of cadvisor.
If you do curl -s http://localhost:8080/metrics, you will notice it appears with every curl call.

The 15s interval you are seeing is likely your Prometheus server scraping at every 15s (the default)?

There was recently a PR fixing a related log spam: #3341 (not yet released)
Perhaps the same kind of fix can be a applied for this error here:

klog.Warningf("Error getting data for container %s because of race condition", name)

The error message itself is a bit misleading, as this is not really a race condition.
The / container should not be added to the entities to collect data for when root cgroup stats are disabled.
But I have not digged deeper on where that is being done to propose a proper fix.

@sgpinkus
Copy link
Author

The 15s interval you are seeing is likely your Prometheus server scraping at every 15s (the default)?

Yes that would be it!

@hhromic
Copy link
Contributor

hhromic commented Oct 10, 2023

The error message itself is a bit misleading, as this is not really a race condition.
The / container should not be added to the entities to collect data for when root cgroup stats are disabled.
But I have not digged deeper on where that is being done to propose a proper fix.

I got some time now and did dig deeper.
Turns out that container metrics are collected recursively starting from / here:

containers, err := c.infoProvider.GetRequestedContainersInfo("/", c.opts)

Therefore it is normal to hit / during collection even when root cgroup stats are disabled.
That being said, I think indeed refactoring the error logging to V(4) as done in #3341 is an appropriate solution.
I will open a PR for it now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants