Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LKL does not support clone #155

Closed
davidchisnall opened this issue May 4, 2020 · 19 comments
Closed

LKL does not support clone #155

davidchisnall opened this issue May 4, 2020 · 19 comments
Assignees
Labels
area: sgx-lkl Core SGX-LKL functionality enhancement p1 Medium priority
Milestone

Comments

@davidchisnall
Copy link
Contributor

davidchisnall commented May 4, 2020

To fix the layering, we need to return to musl creating threads via the clone system call. Currently, LKL does not implement clone at all.

We need to provide an implementation that handles a the flags required for pthread_create. The correct change in LKL may simply be to provide a host_ops hook that handles clone entirely in the LKL consumer. The musl implementation of pthread_create depends on the following clone flags:

  • CLONE_VM: Share address space with the parent. In a single address space world, we cannot support anything other than this.
  • CLONE_FS: Share a filesystem namespace with the parent. In a single-process world, this is the obvious thing to do.
  • CLONE_FILES: Share a file descriptor table with the parent. We probably want to support not having this so that our init process can have a separate FD table.
  • CLONE_SIGHAND. Share signal handlers. We probably want to support not having this so that our init process can have separate signal handlers.
  • CLONE_THREAD. New thread shares PID and has a separate TGID from the parent. It would be nice to support both variations of this so that we can have a distinct PID for init.
  • CLONE_SYSVSEM. Shares ownership of SysV semaphores with the parent. It doesn't matter too much if we support this because our init process shouldn't use SysV IPC.
  • CLONE_SETTLS. Sets the TLS pointer. Should simply set the %fs base value.
  • CLONE_PARENT_SETTID. Sets the child thread ID in the parent. Should be easy to support.
  • CLONE_CHILD_CLEARTID. Uses the return thread ID as a futex. See Intercept clone to handle CLONE_CHILD_CLEARTID #154.
  • CLONE_DETACHED. Has no effect in modern Linux, safe to ignore.
@prp
Copy link
Member

prp commented May 5, 2020

@prp prp added the p1 Medium priority label May 5, 2020
@prp prp added this to the Milestone 1 milestone May 5, 2020
@davidchisnall
Copy link
Contributor Author

Thanks. It looks as if setting __ARCH_WANT_SYS_CLONE gives us a clone implementation, we can then intercept it and validate the arguments returning EINVAL if CLONE_VM is not set.

@davidchisnall
Copy link
Contributor Author

davidchisnall commented May 5, 2020

With that in mind, I believe that we should:

  • Tweak LKL to expose clone
  • Extend lthreads to have a futex value for lthread exit.
  • Add a clone syscall wrapper that:
    • Checks if CLONE_VM is set, returns failure if not.
    • Checks CLONE_CHILD_CLEARTID. If it's set then sets the corresponding flag in the spawned lthread (probably clearing the flag)

This should also make it possible for lthreads to be aware of their Linux tid.

@davidchisnall
Copy link
Contributor Author

I tried enabling the clone system call. Unfortunately, it then crashes in __alloc_pages_nodemask, I believe because the memory that we're passing to clone is memory that LKL doesn't think that it owns. I think that means that we need to fix #187 before we can do this.

@prp
Copy link
Member

prp commented May 6, 2020

I tried enabling the clone system call. Unfortunately, it then crashes in __alloc_pages_nodemask, I believe because the memory that we're passing to clone is memory that LKL doesn't think that it owns. I think that means that we need to fix #187 before we can do this.

@davidchisnall where exactly does it crash? Is it due to an access check?

@davidchisnall
Copy link
Contributor Author

Oh, never mind, my clone userspace wrapper was misaligning the stack.

@davidchisnall
Copy link
Contributor Author

Okay, it looks as if it ends up in LKL's copy_thread function. This creates a new lthread (yay!) but then jumps to the new thread's stack pointer, rather than returning.

That works correctly for PF_KTHREAD threads, it breaks for userspace threads. We can probably handle userspace threads by doing a __switch_to in the new thread, though we have to be careful that the cloned threads aren't aliasing kernel threads accidentally.

@prp
Copy link
Member

prp commented May 7, 2020

@davidchisnall, the problem that you will now face is that the kernel scheduler will want to run these (now kernel-visible) user-level threads. This is inconsistent with the lthread scheduler being in control of their scheduling. You could modify the kernel threads that represent user-level threads to return immediately, but then you will still have the overhead of LKL doing many spurious context-switches.

@davidchisnall
Copy link
Contributor Author

How does this work for kernel threads? These are already created via the same code path and we have a lot of them.

@prp
Copy link
Member

prp commented May 7, 2020

When a userspace lthread does a system call, it assumes the identity of a unique host task (kernel thread) inside the kernel. After the system call has been executed, we have the LKL scheduler context-switch to all pending kernel tasks before returning to userspace.

IIRC the host tasks that represent the userspace lthreads are never selected by the kernel scheduler for execution. Perhaps it will be enough if you simply create the lthread/host task mapping at clone time (and not when the first system call is invoked).

@davidchisnall
Copy link
Contributor Author

I think I am still a bit confused. When a new kernel thread is created, it goes into the copy_thread function in the LKL arch, which then spawns a new lthread. These are switched to with the kernel's __switch_to routine, but are they also run concurrently with the lthread scheduler? If not, how does the lthread scheduler know not to run these threads?

@davidchisnall
Copy link
Contributor Author

Also, which direction of mapping are you talking about? The Linux task structure contains the lthread ID of the thread, which is set when the lthread is created in the LKL arch for any lthread created via the clone_thread call (currently only kernel threads, if we support clone then also userspace threads). Is there also a mapping from lthread to Linux TID or is the mapping unidirectional?

@prp
Copy link
Member

prp commented May 7, 2020

The kernel scheduler's __switch_to routine uses a host_ops semaphore associated with each lthread to signal to the lthread scheduler what can run. Only a single kernel-visible lthread will be unblocked at a time because LKL needs complete control over concurrency.

LKL retrieves the task_struct from the lthread's TLS.

@davidchisnall
Copy link
Contributor Author

davidchisnall commented May 7, 2020

Thanks, that makes sense. To check I understand how this all fits together:

LKL has a notion of a 'host task', which is a task that is externally scheduled, but still has LKL state associated with it.

When an lthread that was not created by LKL calls lkl_syscall, it allocates a new host task. The kernel scheduler is aware of this thread, but it's not ever passed back to the scheduler, so it's never run?

Because LKL has a single process model, all of these threads are assumed to be threads of the init process (their task is looked up by pid [thread ID] in the init pid namespace [process]).

When a thread returns from a syscall, the thread_sched_jb function unlocks

I think we should be able to create cloned threads in almost the same way. Does this sound like a sensible plan?

  • Add another hook to the host ops structure that creates a new host thread given a PC and stack pointer.
  • Modify copy_thread to use this for new threads that don't have PF_KTHREAD set.
  • Explicitly set the task structure for the lthread on lthread creation.
  • Mark the newly created threads as TIF_HOST_THREAD so that LKL regards them as threads that it is not responsible for scheduling and that are allowed to make system calls.

As far as I can tell, lkl_syscall doesn't currently store the return address anywhere, so we'd need to add that into the thread state (__builtin_return_address() would be sufficient, we wouldn't need anything else, a new thread only needs %rip and %rsp set and can then continue). In lthreads, we'd construct the new lthread and initialise the esp and eip values in its cpu_ctx so that the next _switch to it will jump to the new thread. I believe lthread_exit should still work correctly. The userspace code that wraps clone does an exit system call if the function that is passed to it returns, which will trigger LKL to exit the thread (calling lthread_exit), so we won't pop off the top of the stack.

Does that make sense, or have I missed anything important?

@prp
Copy link
Member

prp commented May 7, 2020

Yes, that makes sense to me.

One minor thing is that our threads don't share the parent pid of init, but rather have a different host parent task, otherwise the kernel wouldn't deliver certain signals to pid=1.

@davidchisnall
Copy link
Contributor Author

davidchisnall commented May 13, 2020

I have started working on this in the wip-clone branch. Current status:

  • Clone system call exists.
  • Calling it creates a new lthread.
  • The new lthread begins executing on the provided stack, with the PC set to the return address of the syscall caller.
  • The Linux task_struct created for the clone call is associated with the new thread.
  • Normal syscall return happens in the original lthread
  • The first time either lthread does a syscall, both threads deadlock.

It seems to be nearly there, but I have not yet been able to diagnose the call of the deadlock. In the test case, the timer thread is firing and delivering ticks, but nothing else happens after the first attempt at a syscall. With tracing enabled, we see this with 8 ethreads:

[[    LKL   ]] lkl_syscall(): calling run_syscall() (no=220 task=host2 current=host2)
[[    LKL   ]] alloc_thread_stack_node(): enter (task= node=-1)
[[    LKL   ]] init_ti(): enter
[[    LKL   ]] setup_thread_stack(): enter
[[    LKL   ]] copy_thread(): enter
[[    LKL   ]] lkl_syscall(): returned from run_syscall() (no=220 task=host2 current=host2)
[[    LKL   ]] lkl_syscall(): enter (no=64 current=host2 host0->TIF host0->TIF_SIGPENDING=1)
[[    LKL   ]] do_signal(): enter
[[    LKL   ]] __switch_to(): host2=>ksoftirqd/0
[[    LKL   ]] __switch_to(): ksoftirqd/0=>host2
[[    LKL   ]] lkl_syscall(): done (no=220 task=host2 current=ksoftirqd/0 ret=43)
[[    LKL   ]] lkl_syscall(): enter (no=66 current=host2 host0->TIF host0->TIF_SIGPENDING=1)

With 1 ethread, everything is serialised and we see this:

[[    LKL   ]] lkl_syscall(): enter (no=220 current=host2 host0->TIF host0->TIF_SIGPENDING=1)
[[    LKL   ]] lkl_syscall(): CPU lock acquired
[[    LKL   ]] lkl_syscall(): switching to host task (no=220 task=host2 current=host2)
[[    LKL   ]] switch_to_host_task(): enter (task=host2 current=host2 task->TIF_HOST_THREAD=1 task->TIF_SIGPENDING=0)
[[    LKL   ]] lkl_syscall(): calling run_syscall() (no=220 task=host2 current=host2)
[[    LKL   ]] alloc_thread_stack_node(): enter (task= node=-1)
[[    LKL   ]] init_ti(): enter
[[    LKL   ]] setup_thread_stack(): enter
[[    LKL   ]] copy_thread(): enter
[[    LKL   ]] lkl_syscall(): returned from run_syscall() (no=220 task=host2 current=host2)
[[    LKL   ]] do_signal(): enter
[[    LKL   ]] __switch_to(): host2=>ksoftirqd/0
[[    LKL   ]] lkl_syscall(): done (no=220 task=host2 current=ksoftirqd/0 ret=43)
[[    LKL   ]] lkl_syscall(): enter (no=66 current=ksoftirqd/0 host0->TIF host0->TIF_SIGPENDING=1)
[[    LKL   ]] lkl_syscall(): enter (no=64 current=ksoftirqd/0 host0->TIF host0->TIF_SIGPENDING=1)
[[    LKL   ]] __switch_to(): ksoftirqd/0=>host2

It appears as if one lthread enters lkl_syscall, yields, the other lthread enters lkl_syscall, and also yields. Neither lthread is then ever rescheduled. Both are likely blocking on the same futex, but it's not yet clear which one or why.

@davidchisnall
Copy link
Contributor Author

@prp, do you know how LKL's current works? I wasn't yet able to chase all of the macros. I suspect that, when we copy the TLS from the parent thread, we may be copying accidentally the currently running task, so LKL gets confused on syscall entry.

@davidchisnall
Copy link
Contributor Author

Looking at the preprocessed source for syscalls.c, it appears that current is current_thread_info()->task and current_thread_info() just returns _current_thread_info, so this won't be affected by anything here.

davidchisnall added a commit that referenced this issue May 15, 2020
davidchisnall added a commit that referenced this issue May 20, 2020
This implements two new LKL hooks. The first one to create an lthread
with a specific initial register state (to capture the returns-twice
behaviour of clone, along with the caller's ability to define the stack
and TLS addresses).  The new thread is immediately associated with the
Linux task structure (normally, lthreads are associated with Linux tasks
lazily when they perform a system call).

The second hook destroys a thread.  This is done in response to an exit
system call.  This is somewhat complicated, because LKL never returns to
this thread and the thread's stack may be deallocated by the time we
exit it.

The lthread scheduler does not have an easy way of adding a mechanism to
kill a thread without that thread running.  We can add one eventually,
but for now create a temporary stack that lthreads can use during
teardown and make them run the teardown from there.

Disable access02 test.  It is spuriously passing and this makes it fail.
See #277 for more information.

Fixes #155
davidchisnall added a commit that referenced this issue May 20, 2020
This implements two new LKL hooks. The first one to create an lthread
with a specific initial register state (to capture the returns-twice
behaviour of clone, along with the caller's ability to define the stack
and TLS addresses).  The new thread is immediately associated with the
Linux task structure (normally, lthreads are associated with Linux tasks
lazily when they perform a system call).

The second hook destroys a thread.  This is done in response to an exit
system call.  This is somewhat complicated, because LKL never returns to
this thread and the thread's stack may be deallocated by the time we
exit it.

The lthread scheduler does not have an easy way of adding a mechanism to
kill a thread without that thread running.  We can add one eventually,
but for now create a temporary stack that lthreads can use during
teardown and make them run the teardown from there.

Disable access02 test.  It is spuriously passing and this makes it fail.
See #277 for more information.

Fixes #155
@davidchisnall
Copy link
Contributor Author

This was fixed in #259.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: sgx-lkl Core SGX-LKL functionality enhancement p1 Medium priority
Projects
None yet
Development

No branches or pull requests

3 participants