Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Linux memory allocations int64 not uint64 #876

Merged
merged 1 commit into from
Jun 23, 2017

Conversation

justincormack
Copy link
Contributor

The kernel ABI to these values is a string, which accepts the value -1
to mean "unlimited" or an integer up to 2^63 for an amount of memory in
bytes.

While the internal representation in the kernel is unsigned, this is not
exposed in any ABI directly. Because of the user-kernel memory split, values
over 2^63 are not really useful; indeed that much memory is not supported,
as physical memory is limited to 52 bits in the forthcoming switch to five
level page tables. So it is much more natural to support the value -1 for
unlimited, especially as the actual number needed to represent the maximum
has varied in different kernel versions, and across 32 and 64 bit architectures,
so determining the value to use is not possible, so it is necessary to write
the string -1 to the cgroup files.

See also discussion in

Signed-off-by: Justin Cormack justin.cormack@docker.com

@crosbymichael
Copy link
Member

crosbymichael commented Jun 23, 2017

LGTM

Approved with PullApprove

@crosbymichael crosbymichael added this to the 1.0.0 milestone Jun 23, 2017
Copy link
Contributor

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this PR fits into the current v1.0.0 freeze. I'd like to see it landed in 1.0, but I'm generally in favor of polishing vs. pushing 1.0 out quickly.

config-linux.md Outdated
@@ -270,13 +270,16 @@ For more information, see the kernel cgroups documentation about [memory][cgroup
**`memory`** (object, OPTIONAL) represents the cgroup subsystem `memory` and it's used to set limits on the container's memory usage.
For more information, see the kernel cgroups documentation about [memory][cgroup-v1-memory].

Values specify the limit in bytes, or `-1` for unlimited memory. For `swappiness` the values are from `0` (almost never swap) to `100`
(very likely to be swapped).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your newline is in the middle of the second sentence, but should be after the first period to match our policy.

I'd rather have the swappiness semantics documented in the swappiness entry below (or dropped entirely if you feel the current “sysctl's vm.swappiness” punt is sufficient). Then you can use:

For all parameters except swappiness, values specify…

And in the entries themselves, drop the now-redundant “bytes”. For example:

limit (int64, OPTIONAL) - limit memory usage
reservation (int64, OPTIONAL) - soft limit of memory usage

config-linux.md Outdated
* **`kernel`** *(uint64, OPTIONAL)* - sets hard limit for kernel memory
* **`kernelTCP`** *(uint64, OPTIONAL)* - sets hard limit in bytes for kernel TCP buffer memory
* **`limit`** *(int64, OPTIONAL)* - sets limit of memory usage in bytes
* **`reservation`** *(int64, OPTIONAL)* - sets soft limit of memory usage in bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soft limits have some interesting caveats, so I'd be happier with a direct reference to memory.soft_limit_in_bytes. But that's orthogonal to this PR.

@justincormack
Copy link
Contributor Author

@wking adjusted the wording and line breaks as suggested, and fixed the example which had 0 limits rather than unlimited or unspecified.

@wking
Copy link
Contributor

wking commented Jun 23, 2017 via email

@justincormack
Copy link
Contributor Author

@wking amended

config-linux.md Outdated
* **`kernel`** *(int64, OPTIONAL)* - sets hard limit for kernel memory
* **`kernelTCP`** *(int64, OPTIONAL)* - sets hard limit for kernel TCP buffer memory

For `swappiness` the values are from `0` (almost never swap) to `100` (very likely to be swapped).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0 really “almost never”? From the cgroup docs:

Please note that unlike during the global reclaim, limit reclaim enforces that 0 swappiness really prevents from any swapping even if there is a swap storage available.

which seems stronger than the the sysctl docs:

A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

And neither one of those mentions a max at 100. I don't doubt that 100 is the current max, but I'd rather punt to kernel docs for the swappiness semantics instead of inlining them with this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded to paraphrase the comment in the code https://github.com/torvalds/linux/blob/master/mm/vmscan.c#L149

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a user facing ABI, so the limit of 100 will not change without adding a new parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded to paraphrase the comment in the code…

But is exposing kernel comments in the spec a good idea? What if they change the semantics, MUST runtimes translate from the 0–100 range to whatever the local kernel uses?

What's important about this value is that it get written to memory.swappiness. I'd rather leave the config author and host kernel to sort out the memory.swappiness semantics between themselves, and have the runtime (and this spec) just blindly pass the opaque value through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you believe the important thing is what is written to the memory.swappiness we should not be discussing this PR at all, we should change all these values to strings, which is what is actually written to the files.

Percentages, like bytes, are pretty common in human facing interfaces, just like some of the other values are 0 or 1 which we describe as booleans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you believe the important thing is what is written to the memory.swappiness we should not be discussing this PR at all, we should change all these values to strings, which is what is actually written to the files.

Yeah, and a cgroup config like:

"<controller>": {
    "<filename-suffix>": "<string to write>"
}

For example:

"memory": {
  "swappiness": "60",
  "soft_limit_in_bytes": "-1",
}

would be as expressive as what we have now and a lot easier to document. And it extends to other controllers and properties without needing any spec changes.

The counter-argument is that that approach degenerates into an array-of-syscalls config (e.g. if you set capabilities as integers instead of CAP_* strings. Drawing the line somewhere is useful to keep configs human-readable, but "swappiness": 0 and "swappiness": "0" are equally legible to me.

But overhauling the cgroup-config API is a big change with a 1.0 freeze in place, so in the short term I'd rather just avoid baking more kernel-defined semantics into the spec in situations where the runtime implementation can ignore those semantics.

The kernel ABI to these values is a string, which accepts the value `-1`
to mean "unlimited" or an integer up to 2^63 for an amount of memory in
bytes.

While the internal representation in the kernel is unsigned, this is not
exposed in any ABI directly. Because of the user-kernel memory split, values
over 2^63 are not really useful; indeed that much memory is not supported,
as physical memory is limited to 52 bits in the forthcoming switch to five
level page tables. So it is much more natural to support the value `-1` for
unlimited, especially as the actual number needed to represent the maximum
has varied in different kernel versions, and across 32 and 64 bit architectures,
so determining the value to use is not possible, so it is necessary to write
the string `-1` to the cgroup files.

See also discussion in
- opencontainers/runc#1494
- opencontainers/runc#1492
- opencontainers/runc#1375
- opencontainers/runc#1421

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@mrunalp
Copy link
Contributor

mrunalp commented Jun 23, 2017

LGTM

Approved with PullApprove

1 similar comment
@tianon
Copy link
Member

tianon commented Jun 23, 2017

LGTM

Approved with PullApprove

@tianon tianon merged commit 198f23f into opencontainers:master Jun 23, 2017
justincormack added a commit to justincormack/runc that referenced this pull request Jun 24, 2017
replace opencontainers#1492 opencontainers#1494
fix opencontainers#1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@hqhq
Copy link
Contributor

hqhq commented Jun 24, 2017

See my comments in opencontainers/runc#1492 (comment) , the only benefit we can gain from "uint64 -> int64" is that it makes runc code easier, all other reasons are not making int64 better than uint64, just that int64 is acceptable. I don't know if that's kind of compromise we should make in OCI.

But since there's no real world issue for using int64 for now, let's not back and forth on this anymore, not before 1.0, just want us on the same page on this.

LGTM

Approved with PullApprove

justincormack added a commit to justincormack/runc that referenced this pull request Jun 27, 2017
replace opencontainers#1492 opencontainers#1494
fix opencontainers#1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@vbatts vbatts mentioned this pull request Jul 5, 2017
tiborvass pushed a commit to tiborvass/runc that referenced this pull request Jul 6, 2017
replace opencontainers#1492 opencontainers#1494
fix opencontainers#1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
(cherry picked from commit 3d9074e)
Signed-off-by: Tibor Vass <tibor@docker.com>
tiborvass pushed a commit to tiborvass/runc that referenced this pull request Jul 6, 2017
replace opencontainers#1492 opencontainers#1494
fix opencontainers#1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
(cherry picked from commit 3d9074e)
Signed-off-by: Tibor Vass <tibor@docker.com>
wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Jul 19, 2017
Catch the JSON Schema up with e73cd70 (Make Linux memory allocations
int64 not uint64, 2017-06-23, opencontainers#876), which carried the kernel's 0 to
100 range into the spec.

Signed-off-by: W. Trevor King <wking@tremily.us>
adrianreber pushed a commit to adrianreber/runc that referenced this pull request Jul 27, 2017
replace opencontainers#1492 opencontainers#1494
fix opencontainers#1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@justincormack justincormack deleted the memory-int64 branch January 21, 2023 21:03
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
replace #1492 #1494
fix #1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
replace #1492 #1494
fix #1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
replace #1492 #1494
fix #1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
kolyshkin pushed a commit to kolyshkin/containerd-cgroups that referenced this pull request Nov 6, 2024
replace #1492 #1494
fix #1422

Since opencontainers/runtime-spec#876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants