PVF worker: consider fork instead of thread for better isolation #574

mrcnski · 2023-08-23T15:17:41Z

Right now each PVF job runs in its own thread, but doing a fork may be better:

Read needed information from socket into a buffer.
Fork, child immediately closes socket - but has access to that buffer.
Child runs and validates, information it needs to report back can be sent over a pipe.
Parent waits for child to terminate, communicates back the result and clears out the buffer.

In the end the thread spawning would just be replaced by a fork() call. Reason this might be good: It is just better isolation: The process can now only mess with its own address space and can not possible affect execution of upcoming candidates for example.

Credit: @eskimor

The text was updated successfully, but these errors were encountered:

eskimor · 2023-08-23T15:33:24Z

If the memory usage in the parent is low, I believe fork should end up being equally cheap then threads. The only difference should be COW semantics on memory pages, but if there was very little (mutable) memory used by the parent then this should not really matter.

mrcnski · 2023-08-28T12:51:31Z

@jpserrat If you are still keen on contributing, this would be a great way to make a big impact. It shouldn't be an overly demanding task, while still being interesting.

jpserrat · 2023-08-29T08:22:16Z

@mrcnski I'm going to start working on this now.
Thanks a lot for reaching out.

mrcnski · 2023-08-29T15:12:14Z

Here there is a section called "Why not fork?". But I didn't really understand the concerns and am ending for the day.

mrcnski · 2023-08-31T10:00:11Z

My concern about the thread stuff pep mentioned was, what if we go back to multi-threaded execution at some point. A while back we found that multi-threaded had significantly better performance (at the cost of determinism):
paritytech/polkadot#7206 (comment)
But I don't know if what pep said about threads makes sense.

We also have the memory metrics and cpu time metrics running in separate threads. Those would now have to be on the forked process. But, I think with a separate process we can just have the parent call rusage on the child at the end. We should research it, would be a nice simplification to remove the metrics threads.

jpserrat · 2023-09-01T13:01:37Z

Great points. Regarding your concern, it has a high chance of happening, so I will make this implementation with this in mind, ensuring it's easy to switch if necessary. What do you think?
About the stuff mentioned on the Why not fork?, I'm not sure about it either but I'm going to do some tests.

Just to make sure that I got it all right. This change is going to be on the worker executor entrypoint right? When we spawn a new thread execute thread

mrcnski · 2023-09-04T12:25:05Z

Here's the problem I see with the threads:

We have a thread that runs simultaneously with the prepare/execute threads, that measures CPU time used.
We have another thread that runs along with the prepare thread, that measures memory usage.
Ideally we can stop using threads because they can be hijacked by a malicious job, and made to report wrong values.
1. e.g. a process can take a huge amount of memory but made to report a value that is under the limit, to make it pass e.g. pre-checking.
We can get both of these values with getrusage(2) instead, but there are two issues with it:
1. We can't rely on the forked process to report its own values with RUSAGE_SELF (point (3) above).
2. We can't use RUSAGE_CHILDREN from the parent of the forked process, because it gets values for all children that have terminated, not just the one we're interested in.
  1. It would work for CPU time because that's cumulative and we only would have one process at once.
Conclusion: Due to point (3), ideally we could find a way to get these values some other way, from the parent of the forked process.

Just to make sure that I got it all right. This change is going to be on the worker executor entrypoint right? When we spawn a new thread execute thread

Yep, and also in the prepare worker entrypoint. :) But we can focus on execute right now to keep things simple.

mrcnski · 2023-09-05T17:06:27Z

Sorry for things being up in the air while we figure this out. 😅 I am currently wondering if the threads can be removed:

preparation memory - this is not super accurate and we have plans to get the stat some other way.
CPU time - this can be falsified by an attacker. But, I believe we can just replace this thread with getrusage in the parent of the forked process.

(I appreciate your help with threads in the past. I just hadn't realized that they are not isolated and therefore not secure.)

Unfortunately @s0me0ne-unkn0wn is on vacation and I'm waiting for his input. :)

jpserrat · 2023-09-07T09:58:21Z

No problem, I'm looking at the code to get a better understanding during this time. :)
I just wanted to clarify the CPU time part. From what I understood, currently, we are measuring CPU time per job. With the new approach, we would measure the cumulative CPU time of all children's processes. Is my understanding right? Why are we making this change?

mrcnski · 2023-09-07T10:26:21Z

With the new approach, we would measure the cumulative CPU time of all children's processes.

Good question! Yeah, getrusage gives a cumulative CPU time. But since we run one child process at a time, we can get the delta between the cumulative CPU time after the current job, and the cumulative CPU time after the last job. That should give us the CPU time just for the current job. Sorry for being unclear.

jpserrat · 2023-09-09T11:43:26Z

Now I get it, thanks!
I have another one. The CPU time that the job took is currently measured at the and of prepare_artifact/validate_using_artifact, for this one getrusage can be used to get the duration that the job took to end. However, for the cpu_time_monitor_loop, we would still need the thread, as getrusage can't be used. It is not clear to me how we would be able to remove the separated thread for cpu_time_monitor_loop for the timeout check.

mrcnski · 2023-09-10T09:27:36Z

Ah, you mean how do we stop the child process when it times out? Good point, we still need a thread on the parent side which works in much the same way as cpu_time_monitor_loop. Only difference is that it would call getrusage with the RUSAGE_CHILDREN argument. Does that answer your question?

s0me0ne-unkn0wn · 2023-09-10T09:54:55Z

Sorry for being late to the party. I've read through the discussion but haven't got a strong opinion yet. Here are some thoughts.

First of all, threads are not evil by themselves. We were concerned about inter-job isolation, and that's why job processes are better than job threads, but threads running inside a job (and cleaned up with a job) are not something bad by definition. We could still use them if need be.

Further, you consider the situation when the malicious code breaks out and falsifies the CPU/memory measurement done by some thread running inside a job. That's a valid point for execution (I still believe it's not so valid for preparation, and anyway, it's solved by the limiting global allocator that doesn't require threads). Yes, that may happen. But is that different from a situation when the code breaks out and falsifies validation results themselves, resulting in the inclusion of an invalid parachain block? IIUC, the latter carries a much higher risk than just reporting invalid CPU/RAM measurements, and it has nothing to do with threads at all.

I mean, we shouldn't get rid of threads just to get rid of threads. Observing the resource consumption from the outside is a good idea (we still need to figure out if those measurements would be deterministic enough). We just don't have to push with eliminating threads without a clear need.

mrcnski · 2023-09-10T10:43:15Z

Thanks @s0me0ne-unkn0wn, good points!

About the measurements from the child, keeping the attack surface as low as possible is a good idea. If attackers falsify the validation result, that might be detectable if the attack fails on other machines and a dispute is raised - but falsified measurements would not be detectable. So I still say the child should do only the job it's spawned to do, just to avoid providing any additional attack surface however small it may be.

About threads, you may be right, I was just concerned about pep's points linked above in "Why not fork". Thinking about it more, it may be preferable that no threads are present while fork is being called. pep's point might have been that managing processes in a multi-threaded context can get hairy as there are lots of subtle race conditions involved. (The pidfd API was developed partially to address this, and I would suggest using it if possible.) But I think if we have the CPU time monitor thread, it should start after fork anyway, so it doesn't measure the child start-up time.

Observing the resource consumption from the outside is a good idea (we still need to figure out if those measurements would be deterministic enough).

For CPU time it shouldn't affect determinism, since we are already relying on the kernel to get this stat - doing it from the parent instead should change nothing. For memory, what we have now is a polling-based method which is already not very deterministic at all. Using strace or or your global allocator approach would both be better in this regard.

s0me0ne-unkn0wn · 2023-09-10T11:45:00Z

it may be preferable that no threads are present while fork is being called

I don't think it's a big concern:

The child process is created with a single thread -- the one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause.

jpserrat · 2023-09-12T08:40:53Z

It does answer my question, thanks. Just one more point about this.

Only difference is that it would call getrusage with the RUSAGE_CHILDREN argument.

getrusage with RUSAGE_CHILDREN will return the stats of all terminated children, as we need to get the CPU time from the running child, this is not going to work for this purpose. As RUSAGE_SELF can not be used as well, should we go back to the ProcessTime approach?

About the concern with thread and fork, as mentioned by @mrcnski the CPU and memory threads will start after the fork, do we need to worry about this? If there are other points of concern with this we can do as @s0me0ne-unkn0wn suggested with pthread_atfork.

mrcnski · 2023-09-12T15:31:37Z

getrusage with RUSAGE_CHILDREN will return the stats of all terminated children, as we need to get the CPU time from the running child, this is not going to work for this purpose.

Oh, right... I hate Linux. 😂 We could check /proc/[pid]/stat, but I'm not sure we can assume /proc/ is mounted. At least on my GCP instance it wasn't.

As RUSAGE_SELF can not be used as well, should we go back to the ProcessTime approach?

I don't think ProcessTime API can be used to get the cpu time of a child. But maybe clock_gettime can be used?

But wait... now that we are forking, we can have the forked process set its own CPU time limit right before it runs any untrusted code. I believe it can be done with setrlimit(2).

Great point @jpserrat ⭐️

About the concern with thread and fork, as mentioned by @mrcnski the CPU and memory threads will start after the fork

With the above solution we won't need the CPU time thread. So execute jobs won't need any threads.

For prepare jobs, the memory thread can be put on the side of the forked child. We decided that an attacker manipulating this value is not a concern in our threat model. Eventually it might be replaced with a global allocator tracker, but that remains an ongoing area of research. So for now, let's have a memory thread on the forked child and a thread that performs the prepare job, but no CPU time thread.

s0me0ne-unkn0wn · 2023-09-12T17:21:39Z

We could check /proc/[pid]/stat, but I'm not sure we can assume /proc/ is mounted.

In general case, we cannot assume that, but maybe we could just mimic its behavior? The parent process must have all the needed permissions iiuc. It's somewhere over there: https://elixir.bootlin.com/linux/v4.14/source/fs/proc/array.c#L390

now that we are forking, we can have the forked process set its own CPU time limit right before it runs any untrusted code. I believe it can be done with setrlimit(2).

That's true, but following your worst-case logic, the untrusted code can break out and reset that limit with setrlimit() as well 🙂

mrcnski · 2023-09-12T17:49:15Z

In general case, we cannot assume that, but maybe we could just mimic its behavior? The parent process must have all the needed permissions iiuc. It's somewhere over there: elixir.bootlin.com/linux/v4.14/source/fs/proc/array.c#L390

That seems really hacky. 😂

That's true, but following your worst-case logic, the untrusted code can break out and reset that limit with setrlimit() as well 🙂

Not so! Assuming the untrusted code can't escalate its privilege, which is much harder than just running arbitrary code:

The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit.

s0me0ne-unkn0wn · 2023-09-12T18:25:51Z

That seems really hacky. 😂

Okay, I have an idea of how to do that in not so hacky way, but it introduces some overhead. We could build some matrioshka doll.

The main worker process forks a new process (let's call it trusted)
The trusted process forks a new process (let's call it executor)
The executor executes a PVF
The trusted process observes the executor's CPU time through getrusage(RUSAGE_CHILDREN)
The trusted process reports the CPU time to the main process

I believe the only overhead here is the second fork() as the executor process may communicate with the main process directly. I believe that overhead could be significantly reduced if we could use clone() instead of fork() in one of those bifurcation points.

Just an idea and should be evaluated.

mrcnski · 2023-09-12T18:44:35Z

Interesting idea! I think it has the issue noted above, where we can't get the CPU time with getrusage(RUSAGE_CHILDREN) until the executor process finishes. This is not ideal because there could be some bug or attack where we have some over-running jobs. (We do have another timeout on the host for these situations so the over-running process doesn't take forever. But it's wall-clock timeout (and not CPU time) and super lenient. I'd rather rely on on a real, precise CPU timeout.)

I approve of thinking out-of-the-box and Russian-doll approaches are always a win in my book. But what's wrong with setrlimit? Seems like it does exactly what we want in a single call.

s0me0ne-unkn0wn · 2023-09-13T09:52:23Z

But what's wrong with setrlimit? Seems like it does exactly what we want in a single call.

Probably nothing's wrong with it. Somehow, I was sure you need root privilege to set the hard limit, but manpage says you can lower the limit without privileges, so it should be okay.

I just tend to trust our resource-limiting code more than OS's code 🙂 Because our code is right before our eyes in the first place. But I understand that we shouldn't invent too many wheels as well.

mrcnski · 2023-09-13T10:21:28Z

Phew! @jpserrat I hope everything is clear now. BTW, I'd be more than happy to provide mentorship. Let me know if you would like to get on a call to discuss more, or do some pair programming. My email's on my GitHub profile.

jpserrat · 2023-09-15T09:21:42Z

That would be great @mrcnski, thanks!

mrcnski · 2023-11-03T15:40:59Z

Here there is a section called "Why not fork?". But I didn't really understand the concerns and am ending for the day.

Ah, I think the concerns are not valid because we are calling fork from a single-threaded context. See also https://docs.rs/nix/latest/nix/unistd/fn.fork.html#safety for more info.

This means, the forked child job process can have threads if it wants. And I think it will be necessary to monitor CPU time again, because setrusage has a little caveat I didn't know about.

We can still use getrusage(CHILDREN) to avoid trusting the time metric reported by the child job process. However, we can no longer rely on the process getting killed when it surpasses the time limit. This means malicious code can take up validator resources up until the (very long) lenient timeout is reached. It's unfortunate, but I don't think it's a security issue. We could leave a followup to investigate e.g. cgroups as an alternative, but it's unlikely that we'd get to it soon.

mrcnski mentioned this issue Aug 24, 2023

PVF: consider spawning a new process per job #584

Closed

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

mrcnski mentioned this issue Sep 3, 2023

PVF: more filesystem sandboxing #1373

Merged

1 task

jpserrat mentioned this issue Sep 24, 2023

change prepare worker to use fork instead of threads #1685

Merged

the-right-joyce added this to parachains team board Oct 18, 2023

the-right-joyce moved this to Backlog in parachains team board Oct 18, 2023

mrcnski closed this as completed in #1685 Nov 14, 2023

github-project-automation bot moved this from Todo to Done in PVF Security Hardening Nov 14, 2023

mrcnski moved this from Backlog to Deployed on Polkadot in parachains team board Nov 14, 2023

mrcnski assigned jpserrat Nov 19, 2023

the-right-joyce moved this from Deployed on Polkadot to Completed in parachains team board Nov 29, 2023

bkchr pushed a commit that referenced this issue Apr 10, 2024

Bump async-std from 1.7.0 to 1.8.0 (#574)

5a790c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF worker: consider fork instead of thread for better isolation #574

PVF worker: consider fork instead of thread for better isolation #574

mrcnski commented Aug 23, 2023

eskimor commented Aug 23, 2023

mrcnski commented Aug 28, 2023

jpserrat commented Aug 29, 2023

mrcnski commented Aug 29, 2023

mrcnski commented Aug 31, 2023

jpserrat commented Sep 1, 2023

mrcnski commented Sep 4, 2023 •

edited

Loading

mrcnski commented Sep 5, 2023

jpserrat commented Sep 7, 2023

mrcnski commented Sep 7, 2023

jpserrat commented Sep 9, 2023

mrcnski commented Sep 10, 2023

s0me0ne-unkn0wn commented Sep 10, 2023

mrcnski commented Sep 10, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Sep 10, 2023

jpserrat commented Sep 12, 2023

mrcnski commented Sep 12, 2023

s0me0ne-unkn0wn commented Sep 12, 2023

mrcnski commented Sep 12, 2023

s0me0ne-unkn0wn commented Sep 12, 2023

mrcnski commented Sep 12, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Sep 13, 2023

mrcnski commented Sep 13, 2023

jpserrat commented Sep 15, 2023

mrcnski commented Nov 3, 2023

PVF worker: consider fork instead of thread for better isolation #574

PVF worker: consider fork instead of thread for better isolation #574

Comments

mrcnski commented Aug 23, 2023

eskimor commented Aug 23, 2023

mrcnski commented Aug 28, 2023

jpserrat commented Aug 29, 2023

mrcnski commented Aug 29, 2023

mrcnski commented Aug 31, 2023

jpserrat commented Sep 1, 2023

mrcnski commented Sep 4, 2023 • edited Loading

mrcnski commented Sep 5, 2023

jpserrat commented Sep 7, 2023

mrcnski commented Sep 7, 2023

jpserrat commented Sep 9, 2023

mrcnski commented Sep 10, 2023

s0me0ne-unkn0wn commented Sep 10, 2023

mrcnski commented Sep 10, 2023 • edited Loading

s0me0ne-unkn0wn commented Sep 10, 2023

jpserrat commented Sep 12, 2023

mrcnski commented Sep 12, 2023

s0me0ne-unkn0wn commented Sep 12, 2023

mrcnski commented Sep 12, 2023

s0me0ne-unkn0wn commented Sep 12, 2023

mrcnski commented Sep 12, 2023 • edited Loading

s0me0ne-unkn0wn commented Sep 13, 2023

mrcnski commented Sep 13, 2023

jpserrat commented Sep 15, 2023

mrcnski commented Nov 3, 2023

mrcnski commented Sep 4, 2023 •

edited

Loading

mrcnski commented Sep 10, 2023 •

edited

Loading

mrcnski commented Sep 12, 2023 •

edited

Loading