From 56b27be76673530cdb49c4e7c6485885cdbbffa9 Mon Sep 17 00:00:00 2001 From: Derek Carr Date: Sun, 23 Jul 2017 21:52:23 -0400 Subject: [PATCH] HugePages proposal --- hugepages.md | 308 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 308 insertions(+) create mode 100644 hugepages.md diff --git a/hugepages.md b/hugepages.md new file mode 100644 index 00000000..27e5c5af --- /dev/null +++ b/hugepages.md @@ -0,0 +1,308 @@ +# HugePages support in Kubernetes + +**Authors** +* Derek Carr (@derekwaynecarr) +* Seth Jennings (@sjenning) +* Piotr Prokop (@PiotrProkop) + +**Status**: In progress + +## Abstract + +A proposal to enable applications running in a Kubernetes cluster to use huge +pages. + +A pod may request a number of huge pages. The `scheduler` is able to place the +pod on a node that can satisfy that request. The `kubelet` advertises an +allocatable number of huge pages to support scheduling decisions. A pod may +consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not +overcommitted. + +## Motivation + +Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi +of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have +a built-in memory management unit that manages a list of these pages in +hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of +virtual-to-physical page mappings. If the virtual address passed in a hardware +instruction can be found in the TLB, the mapping can be determined quickly. If +not, a TLB miss occurs, and the system falls back to slower, software based +address translation. This results in performance issues. Since the size of the +TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the +page size. + +A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, +there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other +architectures, but the idea is the same. In order to use huge pages, +application must write code that is aware of them. Transparent huge pages (THP) +attempts to automate the management of huge pages without application knowledge, +but they have limitations. In particular, they are limited to 2Mi page sizes. +THP might lead to performance degradation on nodes with high memory utilization +or fragmentation due to defragmenting efforts of THP, which can lock memory +pages. For this reason, some applications may be designed to (or recommend) +usage of pre-allocated huge pages instead of THP. + +Managing memory is hard, and unfortunately, there is no one-size fits all +solution for all applications. + +## Scope + +This proposal only includes pre-allocated huge pages configured on the node by +the administrator at boot time or by manual dynamic allocation. It does not +discuss how the cluster could dynamically attempt to allocate huge pages in an +attempt to find a fit for a pod pending scheduling. It is anticipated that +operators may use a variety of strategies to allocate huge pages, but we do not +anticipate the kubelet itself doing the allocation. Allocation of huge pages +ideally happens soon after boot time. + +This proposal defers issues relating to NUMA. + +## Use Cases + +The class of applications that benefit from huge pages typically have +- A large memory working set +- A sensitivity to memory access latency + +Example applications include: +- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.) +- Java applications can back the heap with huge pages using the + `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. +- packet processing systems (DPDK) + +Applications can generally use huge pages by calling +- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory +- `mmap()` a file backed by `hugetlbfs` +- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known + Issues). + +1. A pod can use huge pages with any of the prior described methods. +1. A pod can request huge pages. +1. A scheduler can bind pods to nodes that have available huge pages. +1. A quota may limit usage of huge pages. +1. A limit range may constrain min and max huge page requests. + +## Feature Gate + +The proposal introduces huge pages as an Alpha feature. + +It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent +components pending graduation to Beta. + +## Node Specfication + +Huge pages cannot be overcommitted on a node. + +A system may support multiple huge page sizes. It is assumed that most nodes +will be configured to primarily use the default huge page size as returned via +`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems +unless overriden by `default_hugepagesz=1g` in kernel boot parameters. + +For each supported huge page size, the node will advertise a resource of the +form `hugepages-`. On Linux, supported huge page sizes are +determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB` +directory on the host. Kubernetes will expose a `hugepages-` +resource using binary notation form. It will convert `` into the +most compact binary notation using integer values. For example, if a node +supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node +capacity and allocatable values. Operators may set aside pre-allocated huge +pages that are not available for user pods similar to normal memory via the +`--system-reserved` flag. + +There are a variety of huge page sizes supported across different hardware +architectures. It is preferred to have a resource per size in order to better +support quota. For example, 1 huge page with size 2Mi is orders of magnitude +different than 1 huge page with size 1Gi. We assume gigantic pages are even +more precious resources than huge pages. + +Pre-allocated huge pages reduce the amount of allocatable memory on a node. The +node will treat pre-allocated huge pages similar to other system reservations +and reduce the amount of `memory` it reports using the following formula: + +``` +[Allocatable] = [Node Capacity] - + [Kube-Reserved] - + [System-Reserved] - + [Pre-Allocated-HugePages * HugePageSize] - + [Hard-Eviction-Threshold] +``` + +The following represents a machine with 10Gi of memory. 1Gi of memory has been +reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the +allocatable memory has been reduced to account for the amount of huge pages +reserved. + +``` +apiVersion: v1 +kind: Node +metadata: + name: node1 +... +status: + capacity: + memory: 10Gi + hugepages-2Mi: 1Gi + allocatable: + memory: 9Gi + hugepages-2Mi: 1Gi +... +``` + +## Pod Specification + +A pod must make a request to consume pre-allocated huge pages using the resource +`hugepages-` whose quantity is a positive amount of memory in +bytes. The specified amount must align with the ``; otherwise, +the pod will fail validation. For example, it would be valid to request +`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`. + +The request and limit for `hugepages-` must match. Similar to +memory, an application that requests `hugepages-` resource is at +minimum in the `Burstable` QoS class. + +If a pod consumes huge pages via `shmget`, it must run with a supplemental group +that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of +this group is outside the scope of this specification. + +Initially, a pod may not consume multiple huge page sizes in a single pod spec. +Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will +fail validation. We believe it is rare for applications to attempt to use +multiple huge page sizes. This restriction may be lifted in the future with +community presented use cases. Introducing the feature with this restriction +limits the exposure of API changes needed when consuming huge pages via volumes. + +In order to consume huge pages backed by the `hugetlbfs` filesystem inside the +specified container in the pod, it is helpful to understand the set of mount +options used with `hugetlbfs`. For more details, see "Using Huge Pages" here: +https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt + +``` +mount -t hugetlbfs \ + -o uid=,gid=,mode=,pagesize=,size=,\ + min_size=,nr_inodes= none /mnt/huge +``` + +The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy +this use case. A new `medium=HugePages` option would be supported. To write +into this volume, the pod must make a request for huge pages. The `pagesize` +argument is inferred from the `hugepages-` from the resource +request. If in the future, multiple huge page sizes are supported in a single +pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page +size. The existing `sizeLimit` option for `emptyDir` would restrict usage to +the minimum value specified between `sizeLimit` and the sum of huge page limits +of all containers in a pod. This keeps the behavior consistent with memory +backed `emptyDir` volumes whose usage is ultimately constrained by the pod +cgroup sandbox memory settings. The `min_size` option is omitted as its not +necessary. The `nr_inodes` mount option is omitted at this time in the same +manner it is omitted with `medium=Memory` when using `tmpfs`. + +The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It +can consume those pages using `shmget()` or via `mmap()` with the specified +volume. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: example +spec: + containers: +... + volumeMounts: + - mountPath: /hugepages + name: hugepage + resources: + requests: + hugepages-2Mi: 1Gi + limits: + hugepages-2Mi: 1Gi + volumes: + - name: hugepage + emptyDir: + medium: HugePages +``` + +## CRI Updates + +The `LinuxContainerResources` message should be extended to support specifying +huge page limits per size. The specification for huge pages should align with +opencontainers/runtime-spec. + +see: +https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits + +The CRI changes are required before promoting this feature to Beta. + +## Cgroup Enforcement + +To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the +`hugetlb` cgroup must be mounted. + +The `kubepods` cgroup is bounded by the `Allocatable` value. + +The QoS level cgroups are left unbounded across all huge page pool sizes. + +The pod level cgroup sandbox is configured as follows, where `hugepagesize` is +the system supported huge page size(s). If no request is made for huge pages of +a particular size, the limit is set to 0 for all supported types on the node. + +``` +pod/hugetlb..limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-]) +``` + +If the container runtime supports specification of huge page limits, the +container cgroup sandbox will be configured with the specified limit. + +The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level +cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed. + +## Limits and Quota + +The `ResourceQuota` resource will be extended to support accounting for +`hugepages-` similar to `cpu` and `memory`. The `LimitRange` +resource will be extended to define min and max constraints for `hugepages` +similar to `cpu` and `memory`. + +## Scheduler changes + +The scheduler will need to ensure any huge page request defined in the pod spec +can be fulfilled by a candidate node. + +## cAdvisor changes + +cAdvisor will need to be modified to return the number of pre-allocated huge +pages per page size on the node. It will be used to determine capacity and +calculate allocatable values on the node. + +## Roadmap + +### Version 1.8 + +Initial alpha support for huge pages usage by pods. + +### Version 1.9 + +Resource Quota support. Limit Range support. Beta support for huge pages +(pending community feedback) + +## Known Issues + +### Huge pages as shared memory + +For the Java use case, the JVM maps the huge pages as a shared memory segment +and memlocks them to prevent the system from moving or swapping them out. + +There are several issues here: +- The user running the Java app must be a member of the gid set in the + `vm.huge_tlb_shm_group` sysctl +- sysctl `kernel.shmmax` must allow the size of the shared memory segment +- The user's memlock ulimits must allow the size of the shared memory segment +- `vm.huge_tlb_shm_group` is not namespaced. + +### NUMA + +NUMA is complicated. To support NUMA, the node must support cpu pinning, +devices, and memory locality. Extending that requirement to huge pages is not +much different. It is anticipated that the `kubelet` will provide future NUMA +locality guarantees as a feature of QoS. In particular, pods in the +`Guaranteed` QoS class are expected to have NUMA locality preferences. +