Skip to content

Commit

Permalink
x86/um: nommu: syscall translation by zpoline
Browse files Browse the repository at this point in the history
This commit adds a mechanism to hook syscalls for unmodified userspace
programs used under UML in !MMU mode. The mechanism, called zpoline,
translates syscall/sysenter instructions with `call *%rax`, which can be
processed by a trampoline code also installed upon an initcall during
boot. The translation is triggered by elf_arch_finalize_exec(), an arch
hook introduced by another commit.

All syscalls issued by userspace thus redirected to a specific function,
__kernel_vsyscall, introduced as a syscall entry point for !MMU UML.
This totally changes the code path to hook syscall with ptrace(2) used by
MMU-full UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
  • Loading branch information
thehajime committed Dec 6, 2024
1 parent fb8648e commit 38354c7
Show file tree
Hide file tree
Showing 6 changed files with 352 additions and 29 deletions.
83 changes: 68 additions & 15 deletions Documentation/virt/uml/nommu-uml.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,27 @@ called under nommu/UML environment.
works.
- return to userspace

When users enable the zpoline syscall hook (configured with boot
parameter ``zpoline=1``), the code path looks like below;

- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
- (userspace starts)
- calls ``vfork``/``execve`` syscalls
- during execve, more specifically during ``load_elf_fdpic_binary()``
function, kernel translates ``syscall``/``sysenter`` instructions with ``call
*%rax``, which usually point to address 0 to ``NR_syscalls`` (around
512), where trampoline code was installed during startup.
- when syscalls are issued by userspace, it jumps to ``*%rax``, slides
until ``nop`` instructions end, and jump to hooked function,
``__kernel_vsyscall``, which is an entrypoint for syscall under nommu
UML environment.
- call handler function in ``sys_call_table[]`` and follow how UML syscall
works.
- return to userspace

With zpoline syscall hook, the latency is greatly improved while
startup time of a process cost a bit. See more detail in the
Benchmark section.

What are the differences from MMU-full UML ?
============================================
Expand All @@ -42,7 +63,9 @@ MMU-full UML doesn't have:
- generic implementation of memcpy/strcpy/futex is also used
- alternate syscall entrypoint without ptrace
- alternate syscall hook
- hook syscall by seccomp filter
- hook syscall by seccomp filter (when zpoline isn't used)
- translation of ``syscall``/``sysenter`` instructions to a trampoline
code and syscall hooks (when zpoline is used)

With those modifications, it allows us to use unmodified userspace
binaries with nommu UML.
Expand Down Expand Up @@ -128,23 +151,27 @@ lmbench and (self-crafted) getpid benchmark (with v6.12-rc2 uml/next
tree).

.. csv-table:: lmbench (usec)
:header: ,native,um,um-nommu(s)

select-10 ,0.5544,29.7143,2.8920
select-100 ,2.3992,27.7262,3.7794
select-1000 ,20.4708,42.0885,12.6920
syscall ,0.1734,26.2471,2.6070
read ,0.3433,29.8828,2.6923
write ,0.2866,25.9753,2.6925
stat ,1.9195,40.1164,3.1813
open/close ,3.8657,63.4730,6.2049
fork+sh ,1161.1111,5216.5000,462.3077
fork+execve ,536.5263,2117.0000,131.0633
:header: ,native,um,um-nommu(s),um-nommu(z)

select-10 ,0.5544,29.7143,2.8920,0.2834
select-100 ,2.3992,27.7262,3.7794,1.1732
select-1000 ,20.4708,42.0885,12.6920,10.0434
syscall ,0.1734,26.2471,2.6070,0.0999
read ,0.3433,29.8828,2.6923,0.1327
write ,0.2866,25.9753,2.6925,0.1325
stat ,1.9195,40.1164,3.1813,0.4642
open/close ,3.8657,63.4730,6.2049,0.7283
fork+sh ,1161.1111,5216.5000,462.3077,18744.0000
fork+execve ,536.5263,2117.0000,131.0633,4840.6667

.. csv-table:: do_getpid bench (nsec)
:header: ,native,um,um-nommu(s)
:header: ,native,um,um-nommu(s),um-nommu(z)

getpid, 172 , 26807 , 2614
getpid, 172 , 26807 , 2614, 104


(um-nommu(z) is nommu with zpoline syscall hook, um-nommu(s) is with
seccomp syscall hook, respectively)

Limitations
===========
Expand All @@ -164,14 +191,40 @@ implementation inherits the characteristics of other nommu kernels
Thus, we have limited options to userspace programs. We have tested
Alpine Linux with musl-libc, which has a support nommu kernel.

access to mmap_min_addr (if zpoline enabled)
--------------------------------------------
As the mechanism of syscall translations relies on an ability to
write/read memory address zero (0x0), we need to configure host kernel
with the following command::

% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"

supported architecture
----------------------
The current implementation of nommu UML only works on x86_64 SUBARCH.
We have not tested with 32-bit environment.

target of syscall translation (if zpoline enabled)
--------------------------------------------------
The syscall translation only applies to the executable and interpreter
of ELF binary files which are processed by execve(2) syscall for the
moment: other libraries such as linked library and dlopen-ed one
aren't translated; we may be able to trigger the translation by
LD_PRELOAD. JIT compiler generated code is also generated after execve
thus, it is not currently translated.

Note that with musl-libc in Alpine Linux which we've been tested, most
of syscalls are implemented in the interpreter file
(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
linked/loaded libraries might be rare. But it is definitely possible
so, a workaround with LD_PRELOAD is effective.


Further readings about NOMMU UML
================================

- NOMMU UML (original code by Ricardo Koller)
- https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

- zpoline: syscall translation mechanism
- https://www.usenix.org/conference/atc23/presentation/yasukata
3 changes: 3 additions & 0 deletions arch/um/include/shared/os.h
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,9 @@ static inline int os_setup_seccomp(void)
}
#else
extern int os_setup_seccomp(void);

/* zpoline.c */
extern int um_zpoline_enabled;
#endif

#endif
3 changes: 3 additions & 0 deletions arch/x86/um/asm/elf.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,9 @@ do { \
struct linux_binprm;
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
int uses_interp);
struct elf_fdpic_params;
extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
struct elf_fdpic_params *interp_params);

extern unsigned long um_vdso_addr;
#define AT_SYSINFO_EHDR 33
Expand Down
14 changes: 14 additions & 0 deletions arch/x86/um/nommu/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,17 @@ else
endif

obj-y = do_syscall_$(BITS).o entry_$(BITS).o process.o signal.o syscalls_$(BITS).o os-Linux/
obj-y += zpoline.o

# used by zpoline.c to translate syscall/sysenter instructions
# note: only in x86_64 w/ !CONFIG_MMU
inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
quiet_cmd_inat_tables = GEN $@
cmd_inat_tables = $(AWK) -f $(inat_tables_script) $(inat_tables_maps) > $@
$(obj)/inat-tables.c: $(inat_tables_script) $(inat_tables_maps)
$(call cmd,inat_tables)

targets += inat-tables.c
$(obj)/../../lib/inat.o: $(obj)/inat-tables.c
obj-y += ../../lib/insn.o ../../lib/inat.o
Loading

0 comments on commit 38354c7

Please sign in to comment.