Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement brk syscall to support statically linked executables #1138

Closed
wkozaczuk opened this issue May 18, 2021 · 5 comments · Fixed by #1236
Closed

Implement brk syscall to support statically linked executables #1138

wkozaczuk opened this issue May 18, 2021 · 5 comments · Fixed by #1236

Comments

@wkozaczuk
Copy link
Collaborator

The very naive implementation using mmap() might look like this:

diff --git a/linux.cc b/linux.cc
index 20da2146..f48a3107 100644
--- a/linux.cc
+++ b/linux.cc
@@ -37,6 +37,8 @@
 #include <sys/file.h>
 #include <sys/unistd.h>
 #include <sys/random.h>
 
 #include <unordered_map>
 
@@ -371,6 +373,38 @@ static int pselect6(int nfds, fd_set *readfds, fd_set *writefds,
     return pselect(nfds, readfds, writefds, exceptfds, timeout_ts, NULL);
 }
 
+#define ARBITRARY_PROGRAM_BREAK_START 0x300000000000ul
+#define ARBITRARY_PROGRAM_BREAK_SIZE    0x100000
+extern void *program_break;
+
+static u64 long_brk(void *addr)
+{
+    if (!addr && !program_break) {
+        program_break = mmap((void*)ARBITRARY_PROGRAM_BREAK_START, ARBITRARY_PROGRAM_BREAK_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+        addr = program_break;
+    }
+    return (u64)addr;
+}
+#define __NR_long_brk __NR_brk
 long syscall(long number, ...)
 {
     // Save FPU state and restore it at the end of this function
@@ -451,6 +485,10 @@ long syscall(long number, ...)
     SYSCALL2(mkdir, char*, mode_t);
 #endif
     SYSCALL3(mkdirat, int, char*, mode_t);
+    SYSCALL1(long_brk,void*);

To make it possible to de-allocate memory we would need to do something more interesting than that and truly track the addr argument that normally increases and decreases.

@nyh
Copy link
Contributor

nyh commented May 19, 2021

I don't the brk system call (I'm not familiar with "long_brk" - can you link a reference document for this system call?) should be doing what you're making it do. I think it should give the end of the allocated memory - and everything from the arbitrary (?) start until this end needs to be mapped. That's not really what your code does.

In any case, I suggest that before implementing brk, you have some sort of plan how to test it. Maybe some trivial statically-compiled C program that does all sorts of malloc() and free() and writes to the allocated area, or something. Without this sort of test, you would have no idea if your implementation works correctly or not.

@wkozaczuk wkozaczuk changed the title Implement brk syscall to support modern Golang executables and other statically linked executables Implement brk syscall to support statically linked executables May 21, 2021
@wkozaczuk
Copy link
Collaborator Author

Indeed, there is no long_brk. Please note that long_brk is simply an alias to brk (see the #define line) which I added in my exploratory patch to deal with the issue of the non-matching brk() function signature just like we do with some other syscalls.

I guess you are right about the tests. Do you know how exactly the brk syscall should work? I could not quite find any good documentation. Based on some experiments I noticed that typically the very 1st call to brk comes with the null addr to, I am guessing, discover the end (or start?) of the allocated memory. If the end, then where would be the "implicit" start in OSv case?

@nyh
Copy link
Contributor

nyh commented May 23, 2021

I guess you are right about the tests. Do you know how exactly the brk syscall should work? I could not quite find any good documentation. Based on some experiments I noticed that typically the very 1st call to brk comes with the null addr to, I am guessing, discover the end (or start?) of the allocated memory. If the end, then where would be the "implicit" start in OSv case?

Unfortunately, my only experience with the brk/sbrk system calls is from 30 years ago, when I read their manpages on a Unix machine ;-) But I understand the ideas fairly well: The "brk" is the "break address" between the memory used by the application for malloc(), and the memory it doesn't use. Or actually - the addresses above the break can be used for mmap() so modern malloc() implementations uses mmap - not the break - for large allocations. I think that you are right about the first call to brk - you don't know where is the lowest address you can use so you ask brk to give you the current break. If that result is A, you can now sbrk(1000000) and get 1 million bytes between A and A+1000000 that you can use for your allocations. If later you need another million bytes, you call sbrk(1000000) and now can use between A+1000000 and A+2000000. Note that this is "sbrk" but the system call "brk" is not incremental - you need to say brk(A+2000000) to make that the break - and allow you to write between the original break A to A+2000000.
You can decrease the break, but this is very rarely possible, because it requires the application to move around all its allocations so all the free memory is at the very end of the memory region - it's not something that a C program can often do - but you need to support it and it can happen. I think you can indeed implement the break using mmap() - just you need to mmap the entire area between the old break and the new one - you can't allocate 1MB (or any other constant) around the new break.

You can experiment by writing your own statically-linked programs that use brk/sbrk directly and run them on Linux and OSv and see how the result behaves. You can also run on Linux "strace" on any program (e.g., "strace /bin/ls") and see all the system calls it calls - including brk.

@wkozaczuk
Copy link
Collaborator Author

I think I understand how the brk syscall should work but I am not sure how we should implement it using mmap. Unlike mmap which has a starting address and length (finite area of virtual memory), the program break operates on a contiguous and infinite area of virtual memory. What I mean is, when an application asks for A + increment (where A is program break start) or asks to set it to an arbitrary value higher than A, we would need to be able to satisfy it. And if we use anonymous non-populated mmap behind the scenes to implement it, how would it exactly work?

Imagine an app calls brk for the very first time with NULL, and we mmap some initial region A : A + 2MB (we choose 2MB or a multiple of it on purpose to force usage of huge pages). This initial region is non-populated at first but gets populated as the 1st read or write of it. Later the app calls brk with A + i1 where i1 is less than 2MB in which case we simply adjust the internal program break for bookkeeping purposes. The problem arises when the app makes another call with A + i2 where i2 > 2MB. We would need to somehow extend the initial mapping (not sure if OSv mmu code would allow for this) or create another one. But what if the next 2MB region - A + 2MB: A + 4MB is already taken for other purposes because of some other thread called mmap? Maybe the solution would to be create a large enough initial region say A: A + 100G and hope it will be enough.

So given the latest memory mapping:

             vaddr            paddr     size perm memattr name
            40200000           200000   67c434 rwxp  normal kernel
        400000000000                0 40000000 rwxp  normal main
        4000000f0000            f0000    10000 rwxp  normal dmi
        4000000f5a10            f5a10      17c rwxp  normal smbios
        400040000000         40000000 3ffdd000 rwxp  normal main
        40007fe00000         7fe00000   200000 rwxp  normal acpi
        4000feb91000         feb91000     1000 rwxp  normal pci_bar
        4000feb92000         feb92000     1000 rwxp  normal pci_bar
        4000fec00000         fec00000     1000 rwxp  normal ioapic
        500000000000                0 40000000 rwxp  normal page
        500040000000         40000000 3ffdd000 rwxp  normal page
        600000000000                0 40000000 rwxp  normal mempool
        600040000000         40000000 3ffdd000 rwxp  normal mempool

and the mmap area is this:

0000000000000000 : 0000400000000000

So maybe we can pick some arbitrarily large area out of the above like 0x300000000000 : 0x400000000000 and create an anonymous unpopulated mapping for it and simply adjust the program counter within this area (hopefully we can evict/free physical memory if the counter goes down). I think we should be fine given OSv starts mapping at 0x0000200000000000 (see mmu::allocate()). But what if the app passes a hint to mmap with a start within 0x300000000000 : 0x400000000000?

Another concern is that the 0x300000000000 : 0x400000000000 is pretty huge so I am not sure how expensive it would be in terms of TLB and mapping tables even if we use huge pages (2MB). But maybe because the VMA would be un-populated and would be lazily mapped to physical memory as the application increases the break, we would not need a lot of memory for it.

Finally, instead of 0x300000000000 : 0x400000000000 we should mmap 0x300000000000 : 0x300000000000 + M where M is the size of physical memory available to OSv guest because we would never be able to give more memory than that.

@nyh
Copy link
Contributor

nyh commented Apr 16, 2023

I think maybe we can do the following in the first brk/sbrk call (you raised some of the same ideas in your own comment):

  1. Pick a maximum memory size to be used with brk, call it M. Because we don't have swap, this should be lower than the actual amount of physical memory that we have, and should be lower to leave memory for normal system operation.
  2. mmap() M bytes with prot=PROT_NONE. None of this memory will actually be allocated until used.
  3. brk/sbreak will remember the previous break and the new one, and will use mprotect() to make memory (in 2MB multiples) PROT_NONE or PROT_READ|PROT_WRITE as needed.
  4. The memory only gets allocated when actually used by the program.
  5. To unpopulate memory when brk is reduced (a rare situation, but which should be supported nontheless), one thing we can do is to munmap() the removed memory and mmap() it back again to make it lazy-populated again. Presumably if everything is done in 2MB multiples, there is no overhead in unmapping a piece of a large mmap and then mapping it back.

You're right that even 2MB pages take up space in the page table, but since we already have such a "linear map" for malloc(), I don't think it will be too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants