Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory stats for cgroup v2 #2810

Merged
merged 1 commit into from
Mar 2, 2021

Conversation

odinuge
Copy link
Contributor

@odinuge odinuge commented Feb 22, 2021

In cgroup v2, MemoryStats.UseHierarchy has no meaning, since all
resources are tracked as if UseHierarchy=true in v1. cgroup v2 also have
some other names for the stats.

For those interested, the mapping can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n4026

In cgroup v2, MemoryStats.UseHierarchy has no meaning, since all
resources are tracked as if UseHierarchy=true in v1. cgroup v2 also have
some other names for the stats.

For those interested, the mapping can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n4026
@google-cla google-cla bot added the cla: yes label Feb 22, 2021
@k8s-ci-robot
Copy link
Collaborator

Hi @odinuge. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@iwankgb
Copy link
Collaborator

iwankgb commented Feb 22, 2021

/ok-to-test

@odinuge
Copy link
Contributor Author

odinuge commented Feb 25, 2021

/cc @harche

ref; kubernetes/kubernetes#99230

@harche
Copy link
Contributor

harche commented Feb 25, 2021

Wow, looks great. Thanks @odinuge for this PR. I will test this with node e2e and report back.

@odinuge
Copy link
Contributor Author

odinuge commented Feb 25, 2021

Wow, looks great. Thanks @odinuge for this PR. I will test this with node e2e and report back.

Have already tested. Together with the other changes to cadvisor, the results can be found here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/99269/pull-kubernetes-node-crio-cgrpv2-e2e/1363504000409800704/. Not sure what the Pagefault error is, but I don't think that is a problem with cgroup v2, but instead with the tests. The root memory is due to missing memory.current, and the other memory issues is becuase the tests run with crun instead of runc.

@odinuge
Copy link
Contributor Author

odinuge commented Feb 25, 2021

Ahh, the MajorPageFaults error is on the SystemContainers[pods]; then it makes sense due to the different accounting between cgroup v1 and v2.

@harche
Copy link
Contributor

harche commented Feb 25, 2021

other memory issues is becuase the tests run with crun instead of runc.

/cc @giuseppe

@giuseppe
Copy link
Contributor

The root memory is due to missing memory.current, and the other memory issues is becuase the tests run with crun instead of runc.

thanks, what test is exactly failing because of crun? I can help and take a look. Do you know if it is caused by crun creating a subcgroup when systemd is used or is there any other reason?

@odinuge
Copy link
Contributor Author

odinuge commented Feb 25, 2021

thanks, what test is exactly failing because of crun? I can help and take a look. Do you know if it is caused by crun creating a subcgroup when systemd is used or is there any other reason?

Yes, it is due to the run.oci.systemd.subgroup default value. Due to the fact that crun on cgroup v2 write the memory limit (memory.max) to the containers-scope.scope/container, cadvisor reads max for all containers. Since k8s use that value mapped directly from cgroup->pod, the tests fails, and the metrics are wrong.

For runc this "works" as expected, but since the device cgroup in runc is kinda broken for v2, runc doesn't work at all with kubernetes with cgroup v2... :/

@giuseppe
Copy link
Contributor

could we read the cgroup from /proc/$CONTAINER_PID/cgroup? In general there should not be any assumption on what the final cgroup systemd will be (point 5 of the DON'TS: https://systemd.io/CGROUP_DELEGATION/), although I've seen this assumption made everywhere :/

@odinuge
Copy link
Contributor Author

odinuge commented Feb 25, 2021

could we read the cgroup from /proc/$CONTAINER_PID/cgroup?

Hmm. This code (as a big simplification) loops through the cgroup hierarchy, and k8s selects the data it wants by using the cgroup of the container. I don't think k8s should deal with container_pids, that should be the job of the CRI runtime afaik.

Not sure what the best solution is, other than using stats from CRI instead (have not checked if those work similar in cgroup v2 and crun as well)...

In general there should not be any assumption on what the final cgroup systemd will be (point 5 of the DON'TS: https://systemd.io/CGROUP_DELEGATION/), although I've seen this assumption made everywhere :/

Yup. The systemd-implementation in k8s/cadvisor/libcontainer is fragile, and depends highly on the naming scheme... The runtime spec is kinda vague as well, but as I read it, it does give the guarantee either...

When it comes to the systemd-docs, k8s and runc break most of those "don'ts" I guess 😅

edit: But, this discussion is probably outside the scope of this PR..

@bobbypage bobbypage self-assigned this Mar 2, 2021
@bobbypage
Copy link
Collaborator

LGTM, thanks!

@bobbypage bobbypage merged commit 5c551bc into google:master Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants