torvalds · roscheer · Feb 10, 2017 · Feb 12, 2017 · Feb 23, 2017 · Feb 23, 2017
diff --git a/README.md b/README.md
@@ -0,0 +1,29 @@
+# Overview
+
+In the early days of Intel and AMD based 32-bit personal computers and servers, machines often had less than 16MB of memory. Nowadays the high-end servers used in high-performance computing and big data application may have multiple terabytes of memory. While the total amount of memory available increased up to a million times, current operating systems still manage memory with the same 4KB granularity used in those early days. This makes memory management become a significant overhead for many workloads, since all this memory needs to be mapped through very large page tables that need to be maintained by the OS. Another source of overhead is that the use of 4KB page sizes typically causes more TLB misses during memory accesses.
+
+This project aims to evaluate the potential performance bennefits of using a larger page size supported by the x86-64 architecture (ideally 2MB) as the default allocation unit for managing memory *on systems with very large amounts of memory*. It is not a solution for improving performance on every system.
+
+Default page sizes different from 4KB can be already used on platfroms such as Alpha, ARM64 and IA-64, so bugs related to applications, file systems or device drivers' developers assuming memory pages are always 4KB were probably already identified and fixed. This however does not mean that the task of making x86-64 able to use a different default page size is not daunting and that problems fixed on other platforms are still present in x86-64 specific software components.
+
+While some gain in execution performance is almost certain (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), the exact numbers are difficult to determine without testing on an actual implementation with real workloads. There are also trade-offs, such as higher memory consumption for small processes due to increases in memory fragmentation (especially internal fragmentation). Huge memory systems might want to use the new DAX enhancements (https://www.kernel.org/doc/Documentation/filesystems/dax.txt), which require that the file system block size be equal to the Kernel's default page size, so file systems stored in byte-addressable memory (either volatile or non-volatile) will need to use 2MB allocation units as well to use DAX.
+
+# Roadmap (Rough draft)
+
+With so many unknowns, the first task is to get better estimates of the potential gains and trade-offs. The lowest hanging fruit (and suggested first step) is to estimate the impact on memory fragmentation, which can be obtained mathematically by obtaining snapshots of all the memory segments present in memory on a system running typicall workloads. All the segments created using 4KB pages would need to have their beggining and ending addresses rounded to 2MB alignemnts, and then total all the paddings inserted to determine how much memory would become wasted due to increased internal fragmentation.
+
+Once we become confident that the increase in memory consumption lies within acceptable limits, we can start to evaluate performance through an basic implementation of an all-huge page Kernel. The kernel already contains most of the low level code needed to manipulate huge-pages, and making it default to 2MB allocation granularity should be reasonably straightforward. However, this could potentially have unpredictable adverse side-effects in other parts of the x86-64 platform specific code (or even on the platform independent code).
+
+Getting the 2MB default page implementation correct might not be enough to successfully boot Linux. Any executable that requires that a memory segment be created at a specific address which is not 2MB aligned will potentialy (if not certainly) crash. If address space layout randomization (https://pax.grsecurity.net/docs/aslr.txt) is enabled in the kernel, it will probably be running only position independent executables. However, if there are executables that require segments at fixed addresses not aligned at 2MB (and recompiling them is not possible), we may need the modify the Kernel to add padding before these segments during their creation to make them 2MB aligned (but the impact of this throughout the Kernel has yet to be investigated: demand paging? swapping? Do all these procedures need to become aware that the segment was not originally 2MB aligned?) 
+
+
+# FAQ
+
+## Does not transparent huge pages (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html) already provide a good enough solution?
+THP is feature that causes some debate.Transparent huge pages abstracts the complexities of using larger pages from developers and system administrator, but its implementation also incurs additional overhead due to continuous scanning of pages that could be merged into larger pages (khugepaged kernel thread), and splitting the pages back to smaller pages in certain situations. THP also makes the Kernel code more complex. Moreover, THP is not suited for database workloads and currently only maps anonymous memory regions such as heap and stack space. THP is currently disabled by default in order to avoid the risk of increasing the memory footprint of applications without a guaranteed benefit.
+
+## Wouldn't using only 2MB pages overload the Huge TLB, causing more TLB misses?
+This can only be determined precisely performing tests with typical workloads. Indeed, on modern chips such as Intel's Kaby Lake (https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake) we have 128 entries for 4KB pages on the instruction TLB versus only 8 entries for the 2MB pages (the 4KB TLB is 16 times larger than the 2MB TLB!). For data TLB, we have 64 entries for 4KB pages, versus 32 entries for 2MB pages. While there are more entries for 4KB pages, each 2MB page replaces 512 4KB pages, so the overall number of TLB misses might still be reduced, even with the higher pressure incurred on the 2MB TLBs. We also need to take into account that page tables describing 2MB page mappings use only 3 page table layers, resulting in faster handling of each TLB miss.
+
+## Wouldn't the 4KB TLBs become an idle (wasted!) resource?
+That is true. The TLBs are certainly one of the most expensive components of the CPUs. But if the resulting overall system performance actually increases, why would anyone care? Moreover, processors are evolving each year. If using 2MB pages as minimum memory allocation granularity proves to be worthwhile, future x86-64 processors may include flags in their control registers to reconfigure or repurpose the 4KB TLBs to handle 2MB pages (and then we would have even larger jumps in performance).